Open Tabs
- users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb
Kernels
- users_and_tweets_data_based_10000_model_0-15_1-15_1-2_with_standardization_and_all_tweets_of_user.ipynb
Terminals
Google Cloud Storage
/
Name
...
Last Modified
Shared
//data-analysis/users_and_tweets_data_based/
Name
...
Last Modified
- copy_to_remove_users_and_tweet_data_based_model_1_with_standardization_single_tweet.ipynb4 days ago
- copy_to_remove_users_and_tweets_data_based_10000_model_1_users_important_features_only_with_standardization_tweets_features_extracted_grouped_and_single_tweet_mean_acc.ipynb4 days ago
- glove.6B.100d.txt9 years ago
- glove.6B.200d.txt9 years ago
- glove.6B.300d.txt9 years ago
- glove.6B.50d.txt9 years ago
- glove.6B.zip8 years ago
- tweet_data_based_model_1_2_3_4_5_6_7_with_standardization_by_single_tweet.ipynb3 days ago
- users_and_tweets_data_based_10000_model_0-15_1-15_1-2_with_standardization_and_all_tweets_of_user.ipynb11 minutes ago
- users_and_tweets_data_based_10000_model_1_2_3_4_users_important_features_only_with_standardization_tweets_features_extracted_grouped.ipynb3 days ago
- users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb11 hours ago
BigQuery
Resources
- bigquery-public-data
No items match your search.
Query history
- users_and_tweets_data_based_10000_model_1_with_standardization_and_all_tweets_of_user.ipynb
Kernel status: Disconnected
[2]:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as pximport plotly.graph_objects as gofrom plotly.subplots import make_subplotsfrom pandas.api.types import is_numeric_dtypefrom datetime import datetime[3]:
!pip install tensorflowRequirement already satisfied: tensorflow in /home/jupyter/.local/lib/python3.7/site-packages (2.11.0) Requirement already satisfied: absl-py>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.4.0) Requirement already satisfied: astunparse>=1.6.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (1.6.3) Collecting flatbuffers>=2.0 (from tensorflow) Obtaining dependency information for flatbuffers>=2.0 from https://files.pythonhosted.org/packages/6f/12/d5c79ee252793ffe845d58a913197bfa02ae9a0b5c9bc3dc4b58d477b9e7/flatbuffers-23.5.26-py2.py3-none-any.whl.metadata Using cached flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes) Requirement already satisfied: gast<=0.4.0,>=0.2.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (0.4.0) Requirement already satisfied: google-pasta>=0.1.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (0.2.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.56.2) Requirement already satisfied: h5py>=2.9.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.8.0) Requirement already satisfied: keras<2.12,>=2.11.0 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (2.11.0) Collecting libclang>=13.0.0 (from tensorflow) Obtaining dependency information for libclang>=13.0.0 from https://files.pythonhosted.org/packages/ea/df/55525e489c43f9dbb6c8ea27d8a567b3dcd18a22f3c45483055f5ca6611d/libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB) Requirement already satisfied: numpy>=1.20 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.21.6) Requirement already satisfied: opt-einsum>=2.3.2 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.3.0) Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from tensorflow) (23.1) Requirement already satisfied: protobuf<3.20,>=3.9.2 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (3.19.6) Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from tensorflow) (68.0.0) Requirement already satisfied: six>=1.12.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.16.0) Requirement already satisfied: tensorboard<2.12,>=2.11 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorflow) (2.11.2) Collecting tensorflow-estimator<2.12,>=2.11.0 (from tensorflow) Using cached tensorflow_estimator-2.11.0-py2.py3-none-any.whl (439 kB) Collecting termcolor>=1.1.0 (from tensorflow) Using cached termcolor-2.3.0-py3-none-any.whl (6.9 kB) Requirement already satisfied: typing-extensions>=3.6.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (4.7.1) Requirement already satisfied: wrapt>=1.11.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow) (1.15.0) Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow) Obtaining dependency information for tensorflow-io-gcs-filesystem>=0.23.1 from https://files.pythonhosted.org/packages/a4/c0/f9ac791c3f6f58a343b350894a3e92d44e53d20d7cf205988279ebcbc6e5/tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata Using cached tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (14 kB) Requirement already satisfied: wheel<1.0,>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from astunparse>=1.6.0->tensorflow) (0.41.1) Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (2.22.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (0.4.6) Requirement already satisfied: markdown>=2.6.8 in /home/jupyter/.local/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (3.4.4) Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.7/site-packages (from tensorboard<2.12,>=2.11->tensorflow) (2.31.0) Collecting tensorboard-data-server<0.7.0,>=0.6.0 (from tensorboard<2.12,>=2.11->tensorflow) Using cached tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB) Collecting tensorboard-plugin-wit>=1.6.0 (from tensorboard<2.12,>=2.11->tensorflow) Using cached tensorboard_plugin_wit-1.8.1-py3-none-any.whl (781 kB) Collecting werkzeug>=1.0.1 (from tensorboard<2.12,>=2.11->tensorflow) Using cached Werkzeug-2.2.3-py3-none-any.whl (233 kB) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (5.3.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (0.3.0) Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (4.9) Requirement already satisfied: urllib3<2.0 in /opt/conda/lib/python3.7/site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (1.26.16) Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow) (1.3.1) Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow) (4.11.4) Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (3.2.0) Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (3.4) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow) (2023.7.22) Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/conda/lib/python3.7/site-packages (from werkzeug>=1.0.1->tensorboard<2.12,>=2.11->tensorflow) (2.1.1) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow) (3.15.0) Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /opt/conda/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow) (0.5.0) Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow) (3.2.2) Using cached flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB) Using cached libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB) Using cached tensorflow_io_gcs_filesystem-0.33.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.4 MB) Installing collected packages: tensorboard-plugin-wit, libclang, flatbuffers, werkzeug, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server Successfully installed flatbuffers-23.5.26 libclang-16.0.6 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-estimator-2.11.0 tensorflow-io-gcs-filesystem-0.33.0 termcolor-2.3.0 werkzeug-2.2.3
[4]:
import tensorflow as tf2023-09-03 23:40:38.987825: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-03 23:40:53.885297: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-09-03 23:40:53.887013: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-09-03 23:40:53.887041: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[5]:
pd.options.mode.chained_assignment = None [6]:
!pip install kerasRequirement already satisfied: keras in /home/jupyter/.local/lib/python3.7/site-packages (2.11.0)
[7]:
!pip install scikerasCollecting scikeras Using cached scikeras-0.10.0-py3-none-any.whl (27 kB) Requirement already satisfied: importlib-metadata>=3 in /opt/conda/lib/python3.7/site-packages (from scikeras) (4.11.4) Requirement already satisfied: packaging>=0.21 in /opt/conda/lib/python3.7/site-packages (from scikeras) (23.1) Requirement already satisfied: scikit-learn>=1.0.0 in /opt/conda/lib/python3.7/site-packages (from scikeras) (1.0.2) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=3->scikeras) (3.15.0) Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata>=3->scikeras) (4.7.1) Requirement already satisfied: numpy>=1.14.6 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.21.6) Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.7.3) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (1.3.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn>=1.0.0->scikeras) (3.1.0) Installing collected packages: scikeras Successfully installed scikeras-0.10.0
[8]:
import kerasfrom keras.models import Sequential, Modelfrom keras.layers import Input, Dense, Activation, Dropout, Flatten, Embedding, LSTM, Concatenate, Reshape, Bidirectional, SimpleRNNfrom keras.layers.convolutional import Conv1D, Conv2D, MaxPooling1D, MaxPooling2Dfrom keras.callbacks import ModelCheckpoint, EarlyStopping[9]:
import sklearnfrom sklearn.neighbors import NearestNeighborsfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_scorefrom sklearn.metrics import recall_scorefrom sklearn.metrics import f1_scorefrom sklearn.metrics import roc_auc_score[10]:
!pip install livelossplotfrom livelossplot.tf_keras import PlotLossesCallbackCollecting livelossplot Using cached livelossplot-0.5.5-py3-none-any.whl (22 kB) Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from livelossplot) (3.5.3) Collecting bokeh (from livelossplot) Using cached bokeh-2.4.3-py3-none-any.whl (18.5 MB) Requirement already satisfied: ipython==7.* in /opt/conda/lib/python3.7/site-packages (from livelossplot) (7.33.0) Requirement already satisfied: numpy<1.22 in /opt/conda/lib/python3.7/site-packages (from livelossplot) (1.21.6) Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (68.0.0) Requirement already satisfied: jedi>=0.16 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.19.0) Requirement already satisfied: decorator in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (5.1.1) Requirement already satisfied: pickleshare in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.7.5) Requirement already satisfied: traitlets>=4.2 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (5.9.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (3.0.39) Requirement already satisfied: pygments in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (2.16.1) Requirement already satisfied: backcall in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.2.0) Requirement already satisfied: matplotlib-inline in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (0.1.6) Requirement already satisfied: pexpect>4.3 in /opt/conda/lib/python3.7/site-packages (from ipython==7.*->livelossplot) (4.8.0) Requirement already satisfied: Jinja2>=2.9 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (3.1.2) Requirement already satisfied: packaging>=16.8 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (23.1) Requirement already satisfied: pillow>=7.1.0 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (9.5.0) Requirement already satisfied: PyYAML>=3.10 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (6.0.1) Requirement already satisfied: tornado>=5.1 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (6.2) Requirement already satisfied: typing-extensions>=3.10.0 in /opt/conda/lib/python3.7/site-packages (from bokeh->livelossplot) (4.7.1) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (4.38.0) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (1.4.4) Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.7/site-packages (from matplotlib->livelossplot) (2.8.2) Requirement already satisfied: parso<0.9.0,>=0.8.3 in /opt/conda/lib/python3.7/site-packages (from jedi>=0.16->ipython==7.*->livelossplot) (0.8.3) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.9->bokeh->livelossplot) (2.1.1) Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.7/site-packages (from pexpect>4.3->ipython==7.*->livelossplot) (0.7.0) Requirement already satisfied: wcwidth in /opt/conda/lib/python3.7/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython==7.*->livelossplot) (0.2.6) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib->livelossplot) (1.16.0) Installing collected packages: bokeh, livelossplot Successfully installed bokeh-2.4.3 livelossplot-0.5.5
[11]:
!pip install shapimport shapCollecting shap Obtaining dependency information for shap from https://files.pythonhosted.org/packages/b8/d8/15066ae71ba63683b8e53a8bef0e75bd87e95b79ef293f63fa674b351d9b/shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Using cached shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (23 kB) Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from shap) (1.21.6) Requirement already satisfied: scipy in /opt/conda/lib/python3.7/site-packages (from shap) (1.7.3) Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.7/site-packages (from shap) (1.0.2) Requirement already satisfied: pandas in /opt/conda/lib/python3.7/site-packages (from shap) (1.3.5) Requirement already satisfied: tqdm>=4.27.0 in /opt/conda/lib/python3.7/site-packages (from shap) (4.63.0) Requirement already satisfied: packaging>20.9 in /opt/conda/lib/python3.7/site-packages (from shap) (23.1) Collecting slicer==0.0.7 (from shap) Using cached slicer-0.0.7-py3-none-any.whl (14 kB) Requirement already satisfied: numba in /opt/conda/lib/python3.7/site-packages (from shap) (0.56.4) Requirement already satisfied: cloudpickle in /opt/conda/lib/python3.7/site-packages (from shap) (2.2.1) Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /opt/conda/lib/python3.7/site-packages (from numba->shap) (0.39.1) Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from numba->shap) (68.0.0) Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from numba->shap) (4.11.4) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas->shap) (2023.3) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->shap) (1.3.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn->shap) (3.1.0) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.16.0) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->numba->shap) (3.15.0) Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->numba->shap) (4.7.1) Using cached shap-0.42.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (545 kB) Installing collected packages: slicer, shap Successfully installed shap-0.42.1 slicer-0.0.7
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
[12]:
from google.cloud import bigqueryxxxxxxxxxxxxxxxxxxxxAuthentication for working from Google Colab¶
[13]:
# from google.colab import auth# auth.authenticate_user()xxxxxxxxxxxxxxxxxxxxDestination to save trained models¶
xxxxxxxxxxGoogle disk¶
[14]:
# from google.colab import drive# drive.mount('/content/gdrive')# project_folder_path = '/content/gdrive/Shareddrives/Magisterka/PROJEKT/'# models_path = project_folder_path + '/models'xxxxxxxxxxVertex AI Jupyter Lab¶
[15]:
data_analysis_folder_path = '../'models_path = data_analysis_folder_path + '/models'xxxxxxxxxxxxxxxxxxxxConnect to Bigquery service¶
[16]:
import syssys.path.append("./../../")from gcp_env import PROJECT_ID, LOCATION[17]:
project_id = PROJECT_ID # Fill project idbqclient = bigquery.Client(location=LOCATION, project=project_id)xxxxxxxxxxUsers data¶
xxxxxxxxxxxxxxxxxxxxLoading data¶
[18]:
dataset_name = "twitbot_22_preprocessed_common_users_ids"users_table_name = "users"BQ_TABLE_USERS = dataset_name + "." + users_table_nameusers_table_id = project_id + "." + BQ_TABLE_USERS[19]:
# job_config = bigquery.QueryJobConfig(# allow_large_results=True, destination=users_table_id, use_legacy_sql=True# )[20]:
SQL_QUERY = f"""WITH human_records AS ( SELECT *, ROW_NUMBER() OVER () row_num FROM {BQ_TABLE_USERS} WHERE label = 'human' LIMIT 5000), bot_records AS ( SELECT *, ROW_NUMBER() OVER () row_num FROM {BQ_TABLE_USERS} WHERE label = 'bot' LIMIT 5000) SELECT * FROM human_records UNION ALL SELECT * FROM bot_records ORDER BY row_num;"""users_df1 = bqclient.query(SQL_QUERY).to_dataframe()users_df1 = users_df1.drop(['row_num'], axis=1)[21]:
# LIMIT RESULTS OPTIONSpd.set_option('display.max_rows', 100)# pd.set_option('display.max_rows', None)pd.set_option('display.max_column', None)pd.set_option('display.max_colwidth', None)[22]:
num_bots = len(users_df1.loc[users_df1['label']=='bot']) # bots numbernum_humans = len(users_df1.loc[users_df1['label']=='human']) # humans numberprint("Number of real users: ", num_humans)print("Number of bots: ", num_bots)Number of real users: 5000 Number of bots: 5000
[23]:
org_users_df = pd.DataFrame(users_df1).copy()users_df2 = pd.DataFrame(org_users_df).copy()[24]:
def filter_df_for_balanced_classes(df, bot_label_value='bot', human_label_value='human'): new_df = pd.DataFrame() i = 0 # bots iter. j = 0 # humans iter. k = 0 num_bots = len(df.loc[df['label']==bot_label_value]) num_humans = len(df.loc[df['label']==human_label_value]) max_num = min(num_bots, num_humans) for index, record in df.iterrows(): if k < (2*max_num): if record['label']==bot_label_value and i < max_num: new_df = new_df.append(record) # users_df = pd.concat([users_df, record], ignore_index=True) i += 1 k += 1 if record['label']==human_label_value and j < max_num: new_df = new_df.append(record) # users_df = pd.concat([users_df, record], ignore_index=True) j += 1 k += 1 print("Number of bots: ", len(new_df.loc[new_df['label']==bot_label_value])) print("Number of human users: ", len(new_df.loc[new_df['label']==human_label_value])) return pd.DataFrame(new_df).copy();[25]:
# users_df = filter_df_for_balanced_classes(users_df2)users_df = pd.DataFrame(users_df2).copy()xxxxxxxxxxData preparation¶
[26]:
def drop_columns(df, columns): for column_name in columns: df = df.drop([column_name], axis=1) return df[27]:
def encode_not_numeric_columns(df): for column_name in df: if not is_numeric_dtype(df[column_name]): unique_values_dict = dict(enumerate(df[column_name].unique())) unique_values_dict = dict((v, k) for k, v in unique_values_dict.items()) df[column_name] = df[column_name].map(unique_values_dict) return dfxxxxxxxxxxAlign values for bool columns¶
[28]:
boolean_columns = ["verified", "protected", "withheld", "has_location", "has_profile_image_url", "has_pinned_tweet", "has_description"][29]:
# Firstly align boolean columns valuesfor col_name in boolean_columns: users_df[col_name] = users_df[col_name].astype(bool)column_to_remove = []# Check unique values (some of subset can have only one unique value for some feature) if so it column will be removed from dataframefor col_name in boolean_columns: uniq_val_list = users_df[col_name].unique() print("Column {:<24} {}".format(col_name, str(uniq_val_list))) if (len(uniq_val_list) < 2): column_to_remove.append(col_name)Column verified [False True] Column protected [False True] Column withheld [False] Column has_location [ True False] Column has_profile_image_url [ True False] Column has_pinned_tweet [False True] Column has_description [ True False]
[30]:
column_to_remove[30]:
['withheld']
[31]:
# remove from bool columns:for col_name in column_to_remove: boolean_columns.remove(col_name)# remove from dataframeusers_df = drop_columns(users_df, column_to_remove)xxxxxxxxxxEncoding of non-numeric information which will be used by model¶
[32]:
# Remap the values of the dataframefor col_name in boolean_columns: users_df[col_name] = users_df[col_name].map({True:1,False:0})# Remap label values human/bot for 0/1label_col = "label"users_df[label_col] = users_df[label_col].map({"human":0,"bot":1})[33]:
users_df[33]:
| id | label | username | name | created_at | verified | protected | has_location | location | has_profile_image_url | has_pinned_tweet | url | followers_count | following_count | tweet_count | listed_count | has_description | description | descr_no_hashtags | descr_no_cashtags | descr_no_mentions | descr_no_urls | url_no_urls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1428769922507751429 | 1 | BotoxAesthetics | dermalfillers Aesthetics botox | 1629480285 | 0 | 0 | 1 | London , United Kingdom | 1 | 0 | https://t.co/CBDBvXnRKv | 2 | 41 | 1 | 0 | 1 | Enhance fillers is a progressive company found in the city of Webminster,We offer a wide range of aesthetic services including Botox, Dysport, Xeomin,the Juvede | 0 | 0 | 0 | 0 | 1 |
| 1 | 1484544053572419585 | 0 | blessing_xettry | #Blessing xettry | 1642777877 | 0 | 0 | 1 | Nepal | 1 | 0 | 0 | 24 | 1 | 0 | 1 | Okay, well, maybe not forever. But at least until you make some changes. | 0 | 0 | 0 | 0 | 0 | |
| 2 | 842202106324951040 | 1 | Mark11474609 | Mark | 1489631604 | 0 | 0 | 1 | Kelvin Grove, Brisbane | 1 | 0 | 3 | 22 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 3 | 1447956502443069446 | 0 | menametaken | winwinnie | 1634054741 | 0 | 0 | 1 | your walls | 1 | 0 | 0 | 20 | 1 | 0 | 1 | 20 | uni student | life goes brrr \nyes I do and it's called art\n#thickthighssavelifes | 1 | 0 | 0 | 0 | 0 | |
| 4 | 21309002 | 1 | Sjouzan | Zuzana | 1235058272 | 0 | 0 | 1 | Brighton, UK | 1 | 0 | 3 | 42 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 3275187061 | 0 | LZconcussion | Concussion Recovery | 1436585642 | 0 | 0 | 1 | Park City, UT | 1 | 1 | https://t.co/KpjP54TOGR | 352 | 190 | 1094 | 18 | 1 | Specializing in #concussionmanagement including #education, #therapy options / recommendations & our standardized #ReturntoLifeandSport #exerciseprogression | 5 | 0 | 0 | 0 | 1 |
| 9996 | 1485289449487572996 | 1 | davie73smith | Davie | 1642955586 | 0 | 0 | 0 | None | 1 | 0 | 0 | 34 | 0 | 0 | 1 | F U N | 0 | 0 | 0 | 0 | 0 | |
| 9997 | 1215382704876871680 | 0 | USC_TrueVote | USC Election Cybersecurity Initiative | 1578604840 | 0 | 0 | 1 | washington dc | 1 | 0 | https://t.co/jlreKFwVEc | 608 | 1265 | 1095 | 11 | 1 | Platform, party, and vendor-agnostic.\nOur candidate is democracy. ����\nTraining in all 50 states ✈ �� ��\nSupport from @google\nUpcoming Training Events �� | 0 | 0 | 1 | 0 | 1 |
| 9998 | 1480725883820208131 | 1 | Theresa823 | Theresa Coleman | 1641867569 | 0 | 0 | 0 | None | 1 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| 9999 | 407458156 | 0 | ManyDullKnives | Many Dull Knives | 1320722300 | 0 | 0 | 1 | Toronto, Canada. | 1 | 1 | http://t.co/JyuusxBRDF | 864 | 346 | 153532 | 47 | 1 | The official Twitter account of the webcomic, Many Dull Knives, by @jHYtse. (This is a humourous strip and not about a cowardly cutter) | 0 | 0 | 1 | 0 | 1 |
10000 rows × 23 columns
xxxxxxxxxxNull and NaN statistics¶
[34]:
for col_name in users_df: count1 = pd.isnull(users_df[col_name]).sum() print(col_name + ": " + str(count1))id: 0 label: 0 username: 0 name: 0 created_at: 0 verified: 0 protected: 0 has_location: 0 location: 3476 has_profile_image_url: 0 has_pinned_tweet: 0 url: 0 followers_count: 0 following_count: 0 tweet_count: 0 listed_count: 0 has_description: 0 description: 0 descr_no_hashtags: 0 descr_no_cashtags: 0 descr_no_mentions: 0 descr_no_urls: 0 url_no_urls: 0
xxxxxxxxxxExtract some information from dataframe to new columns¶
xxxxxxxxxxDescription length¶
[35]:
users_df['descr_len'] = users_df['description'].apply(len).astype(float)xxxxxxxxxxAccount age (in days) (sice 16.03.2022) (dataset data collected during the 20/01-15/03/2022 period)¶
[36]:
from datetime import datetime[37]:
def cal_days_diff(a,b): A = a.replace(hour = 0, minute = 0, second = 0, microsecond = 0) B = b.replace(hour = 0, minute = 0, second = 0, microsecond = 0) return (A - B).daysdef convert_unixtime_to_datetime(a): return datetime.utcfromtimestamp(a)[38]:
base_date = datetime(2022, 3, 16)users_df['account_age'] = users_df.apply(lambda x: cal_days_diff(base_date, convert_unixtime_to_datetime(x.created_at)), axis=1).astype(float)xxxxxxxxxxReduce unnecessary columns¶
[39]:
# users_reduced_df = pd.DataFrame(users_df).copy()# # columns_to_drop = ["id", "username", "name", "created_at", "location", "url", "description"]# columns_to_drop = ["username", "name", "created_at", "location", "url", "description"]# users_reduced_df = drop_columns(users_reduced_df, columns_to_drop)# users_reduced_dfxxxxxxxxxxFilter data, left column by feature importance based on SHAP results¶
[40]:
shap_features = ['followers_count', 'tweet_count', 'following_count', 'account_age', 'descr_len'][41]:
users_reduced_df = users_df.copy()users_reduced_df = users_df.filter(['label', 'id']+shap_features)users_reduced_df[41]:
| label | id | followers_count | tweet_count | following_count | account_age | descr_len | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1428769922507751429 | 2 | 1 | 41 | 208.0 | 160.0 |
| 1 | 0 | 1484544053572419585 | 0 | 1 | 24 | 54.0 | 72.0 |
| 2 | 1 | 842202106324951040 | 3 | 4 | 22 | 1826.0 | 0.0 |
| 3 | 0 | 1447956502443069446 | 0 | 1 | 20 | 155.0 | 85.0 |
| 4 | 1 | 21309002 | 3 | 2 | 42 | 4773.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 0 | 3275187061 | 352 | 1094 | 190 | 2440.0 | 156.0 |
| 9996 | 1 | 1485289449487572996 | 0 | 0 | 34 | 52.0 | 5.0 |
| 9997 | 0 | 1215382704876871680 | 608 | 1095 | 1265 | 797.0 | 153.0 |
| 9998 | 1 | 1480725883820208131 | 0 | 0 | 5 | 64.0 | 0.0 |
| 9999 | 0 | 407458156 | 864 | 153532 | 346 | 3781.0 | 135.0 |
10000 rows × 7 columns
xxxxxxxxxxData type conversion (to float)¶
[42]:
for (column_name, column_data) in users_reduced_df.iteritems(): if (column_name != 'id'): users_reduced_df[column_name] = users_reduced_df[column_name].astype(float)xxxxxxxxxxData split for training, validation and testing of users data¶
[43]:
train_users_data, test_users_data = train_test_split(users_reduced_df, test_size=0.30, random_state=25, shuffle=True)test_users_data, val_users_data = train_test_split(test_users_data, test_size=0.5, random_state=25, shuffle=True)xxxxxxxxxxDescribe trainig dataset of users dataset¶
[44]:
train_users_data.describe()[44]:
| label | followers_count | tweet_count | following_count | account_age | descr_len | |
|---|---|---|---|---|---|---|
| count | 7000.000000 | 7.000000e+03 | 7.000000e+03 | 7000.000000 | 7000.000000 | 7000.000000 |
| mean | 0.503857 | 6.229971e+03 | 6.554910e+03 | 1253.036286 | 2442.663000 | 84.592000 |
| std | 0.500021 | 4.412925e+04 | 3.316229e+04 | 6121.951212 | 1640.465006 | 59.651674 |
| min | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 22.000000 | 0.000000 |
| 25% | 0.000000 | 3.300000e+01 | 2.200000e+01 | 74.000000 | 818.000000 | 23.000000 |
| 50% | 1.000000 | 2.710000e+02 | 5.015000e+02 | 269.000000 | 2407.000000 | 95.000000 |
| 75% | 1.000000 | 1.565500e+03 | 3.310500e+03 | 899.000000 | 3995.000000 | 143.000000 |
| max | 1.000000 | 1.730667e+06 | 1.184641e+06 | 244195.000000 | 5724.000000 | 243.000000 |
xxxxxxxxxxDescribes training users data for bots¶
[45]:
train_users_data.loc[train_users_data['label']==1].describe()[45]:
| label | followers_count | tweet_count | following_count | account_age | descr_len | |
|---|---|---|---|---|---|---|
| count | 3527.0 | 3527.000000 | 3527.000000 | 3527.000000 | 3527.000000 | 3527.000000 |
| mean | 1.0 | 2016.999716 | 2185.104338 | 770.499008 | 2060.185143 | 67.609300 |
| std | 0.0 | 19503.794857 | 11279.654017 | 4195.024713 | 1565.122289 | 62.244412 |
| min | 1.0 | 0.000000 | 0.000000 | 0.000000 | 30.000000 | 0.000000 |
| 25% | 1.0 | 14.000000 | 7.000000 | 41.000000 | 604.000000 | 0.000000 |
| 50% | 1.0 | 81.000000 | 127.000000 | 140.000000 | 1776.000000 | 58.000000 |
| 75% | 1.0 | 410.000000 | 1086.000000 | 431.000000 | 3446.500000 | 134.000000 |
| max | 1.0 | 702018.000000 | 497641.000000 | 150720.000000 | 5484.000000 | 243.000000 |
xxxxxxxxxxDescribes training users data for humans¶
[46]:
train_users_data.loc[train_users_data['label']==0].describe()[46]:
| label | followers_count | tweet_count | following_count | account_age | descr_len | |
|---|---|---|---|---|---|---|
| count | 3473.0 | 3.473000e+03 | 3.473000e+03 | 3473.000000 | 3473.000000 | 3473.000000 |
| mean | 0.0 | 1.050845e+04 | 1.099266e+04 | 1743.076303 | 2831.087820 | 101.838756 |
| std | 0.0 | 5.918594e+04 | 4.526135e+04 | 7563.173323 | 1624.084521 | 51.457456 |
| min | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 22.000000 | 0.000000 |
| 25% | 0.0 | 1.570000e+02 | 1.700000e+02 | 159.000000 | 1343.000000 | 63.000000 |
| 50% | 0.0 | 9.130000e+02 | 1.578000e+03 | 499.000000 | 3094.000000 | 115.000000 |
| 75% | 0.0 | 3.610000e+03 | 6.601000e+03 | 1413.000000 | 4342.000000 | 149.000000 |
| max | 0.0 | 1.730667e+06 | 1.184641e+06 | 244195.000000 | 5724.000000 | 181.000000 |
xxxxxxxxxxData analysis¶
xxxxxxxxxxDistribution of label class in training, validation and test set of users data¶
[47]:
stack_data = {'Set': ['Training data', 'Validation data', 'Test data', 'Training data', 'Validation data', 'Test data'], 'Label': ['Bot', 'Bot', 'Bot', 'Human', 'Human', 'Human'], 'Freq': [len(train_users_data.loc[train_users_data['label']==1]), len(val_users_data.loc[val_users_data['label']==1]), len(test_users_data.loc[test_users_data['label']==1]), len(train_users_data.loc[train_users_data['label']==0]), len(val_users_data.loc[val_users_data['label']==0]), len(test_users_data.loc[test_users_data['label']==0])]}sdf = pd.DataFrame(stack_data)sdf[47]:
| Set | Label | Freq | |
|---|---|---|---|
| 0 | Training data | Bot | 3527 |
| 1 | Validation data | Bot | 743 |
| 2 | Test data | Bot | 730 |
| 3 | Training data | Human | 3473 |
| 4 | Validation data | Human | 757 |
| 5 | Test data | Human | 770 |
[48]:
fig = px.bar(sdf, x="Set", y="Freq", color="Label", hover_data=['Label'], barmode = 'group')fig.update_layout( title_text='Distribution of bot/human classes in training, validation and test dataset', xaxis_title_text='', #'subset', yaxis_title_text='frequency', bargap=0.05, bargroupgap=0.05, width=700, height=500, legend={"title":""})fig.show()xxxxxxxxxxDistribution of other features in training dataset¶
xxxxxxxxxxfollowers_count¶
[49]:
fig = go.Figure()fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==1,'followers_count'], # histnorm='density', nbinsx=200, name='Bot'),row=1, col=1)fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==0,'followers_count'], # histnorm='density', nbinsx=200, name='Human'),row=1, col=2)fig.update_layout( title_text='Distribution of values of training dataset column: followers_count', xaxis_title_text='followers_count', #'feature', yaxis_title_text='frequency', bargap=0.5, bargroupgap=None, #0.8, width=1100, height=450, legend={"title":""}, xaxis=dict(showgrid=True,title='followers_count', dtick=50000, range=[0, max(train_users_data.loc[train_users_data['label']==1,'followers_count'])+25000]), xaxis2=dict(showgrid=True, dtick=100000, range=[0, max(train_users_data.loc[train_users_data['label']==0,'followers_count'])+50000]), yaxis=dict(showgrid=True))fig.show()[50]:
len(train_users_data[(train_users_data['label']==1)])[50]:
3527
[51]:
len(train_users_data[(train_users_data['label']==0)])[51]:
3473
[52]:
from scipy.stats import expon# Fit an exponential distribution to dataloc_b, scale_b = expon.fit(train_users_data.loc[train_users_data['label']==1]['followers_count'])loc_h, scale_h = expon.fit(train_users_data.loc[train_users_data['label']==0]['followers_count'])# Calculate the 99th percentile using the percent-point function (inverse CDF)percentile_99_bots = expon.ppf(0.99, loc=loc_b, scale=scale_b)percentile_99_humans = expon.ppf(0.99, loc=loc_h, scale=scale_h)df_reduced_outliers_followers_count = train_users_data[((train_users_data['label']==1) & (train_users_data['followers_count'] < percentile_99_bots)) | ((train_users_data['label']==0) & (train_users_data['followers_count'] < percentile_99_humans))]df_filtered_bots = train_users_data[(train_users_data['label']==1) & (train_users_data['followers_count'] < percentile_99_bots)]df_filtered_humans = train_users_data[(train_users_data['label']==0) & (train_users_data['followers_count'] < percentile_99_humans)][53]:
def df_99_percentile(df, column_name): # Fit an exponential distribution to data loc_b, scale_b = expon.fit(df.loc[df['label']==1][column_name]) loc_h, scale_h = expon.fit(df.loc[df['label']==0][column_name]) # Calculate the 99th percentile using the percent-point function (inverse CDF) percentile_99_bots = expon.ppf(0.99, loc=loc_b, scale=scale_b) percentile_99_humans = expon.ppf(0.99, loc=loc_h, scale=scale_h) return df[((df['label']==1) & (df[column_name] < percentile_99_bots)) | ((df['label']==0) & (df[column_name] < percentile_99_humans))]xxxxxxxxxxfollowing_count¶
[54]:
fig = go.Figure()fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==1,'following_count'], # histnorm='density', nbinsx=200, name='Bot'),row=1, col=1)fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==0,'following_count'], # histnorm='density', nbinsx=200, name='Human'),row=1, col=2)fig.update_layout( title_text='Distribution of values of training dataset column: following_count', xaxis_title_text='following_count', #'feature', yaxis_title_text='frequency', bargap=0.5, bargroupgap=None, #0.8, width=1100, height=450, legend={"title":""}, xaxis=dict(showgrid=True, dtick=10000, range=[0, max(train_users_data['following_count'])+5000]), xaxis2=dict(showgrid=True, dtick=10000, range=[0, max(train_users_data['following_count'])+5000]), yaxis=dict(showgrid=True))fig.show()xxxxxxxxxxtweet_count¶
[55]:
fig = go.Figure()fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==1,'tweet_count'], # histnorm='density', nbinsx=200, name='Bot'),row=1, col=1)fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==0,'tweet_count'], # histnorm='density', nbinsx=200, name='Human'),row=1, col=2)fig.update_layout( title_text='Distribution of values of training dataset column: tweet_count', xaxis_title_text='tweet_count', #'feature', yaxis_title_text='frequency', bargap=0.5, bargroupgap=None, #0.8, width=1100, height=450, legend={"title":""}, xaxis=dict(showgrid=True, dtick=20000), xaxis2=dict(showgrid=True, dtick=100000), yaxis=dict(showgrid=True))fig.show()xxxxxxxxxxdescr_len¶
[56]:
fig = go.Figure()fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==1,'descr_len'], # histnorm='density', # nbinsx=30, name='Bot'),row=1, col=1)fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==0,'descr_len'], # histnorm='density', # nbinsx=30, name='Human'),row=1, col=2)fig.update_layout( title_text='Distribution of values of training dataset column: descr_len', xaxis_title_text='descr_len', #'feature', yaxis_title_text='frequency', bargap=0.2, bargroupgap=None, #0.8, width=1100, height=350, legend={"title":""}, xaxis=dict(showgrid=True, dtick=10, range=[0, max(train_users_data['descr_len'])+5]), xaxis2=dict(showgrid=True, dtick=10, range=[0, max(train_users_data['descr_len'])+5]), yaxis=dict(showgrid=True))fig.show()xxxxxxxxxxaccount_age¶
[57]:
fig = go.Figure()fig = make_subplots(rows=1, cols=2, specs=[[{'type':'histogram'}, {'type':'histogram'}]])fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==1,'account_age'], # histnorm='density', # nbinsx=30, name='Bot'),row=1, col=1)fig.add_trace(go.Histogram( x=train_users_data.loc[train_users_data['label']==0,'account_age'], # histnorm='density', # nbinsx=30, name='Human'),row=1, col=2)fig.update_layout( title_text='Distribution of values of training dataset column: account_age', xaxis_title_text='account_age', #'feature', yaxis_title_text='frequency', bargap=0.2, bargroupgap=None, #0.8, width=1100, height=350, legend={"title":""}, xaxis=dict(showgrid=True, dtick=500, range=[0, max(train_users_data['account_age'])+250]), xaxis2=dict(showgrid=True, dtick=500, range=[0, max(train_users_data['account_age'])+250]), yaxis=dict(showgrid=True))fig.show()[58]:
len(train_users_data)[58]:
7000
xxxxxxxxxxFilter to have the same number of records for each class - part II¶
[59]:
train_users_data = filter_df_for_balanced_classes(train_users_data, bot_label_value=1, human_label_value=0)val_users_data = filter_df_for_balanced_classes(val_users_data, bot_label_value=1, human_label_value=0)test_users_data = filter_df_for_balanced_classes(test_users_data, bot_label_value=1, human_label_value=0)Number of bots: 3473 Number of human users: 3473 Number of bots: 743 Number of human users: 743 Number of bots: 730 Number of human users: 730
xxxxxxxxxxFirst drop columns in dataframes where there are same value in whole columns in training dataset¶
[60]:
same_data_columns = list(train_users_data.columns[train_users_data.apply(lambda x: x.nunique()) == 1])same_data_columns[60]:
[]
[61]:
train_users_data = train_users_data.drop(same_data_columns, axis=1)val_users_data = val_users_data.drop(same_data_columns, axis=1)test_users_data = test_users_data.drop(same_data_columns, axis=1)xxxxxxxxxxStandardize data by column range of training set¶
[62]:
def standardize_column(df, col_name, mean_training, std_training): df_cp = df.copy() df_cp[col_name] = (df[col_name] - mean_training) / std_training return df_cp[63]:
# columns_to_standardize = ['followers_count', 'following_count', 'tweet_count', 'descr_len', 'account_age']columns_to_standardize = list(train_users_data.columns)columns_to_standardize.remove('label')columns_to_standardize.remove('id')[64]:
for column_name in columns_to_standardize: mean_training = train_users_data[column_name].mean() std_training = train_users_data[column_name].std() print(column_name) print("mean_training = ", mean_training) print("std_training = ", std_training) print() train_users_data = standardize_column(train_users_data, column_name, mean_training, std_training) val_users_data = standardize_column(val_users_data, column_name, mean_training, std_training) test_users_data = standardize_column(test_users_data, column_name, mean_training, std_training)followers_count mean_training = 6208.796717535272 std_training = 44145.842669630874 tweet_count mean_training = 6595.194212496401 std_training = 33286.69197462025 following_count mean_training = 1259.0381514540743 std_training = 6144.786564521689 account_age mean_training = 2443.6521739130435 std_training = 1640.571178505393 descr_len mean_training = 84.77094730780306 std_training = 59.6092442340293
xxxxxxxxxxCorrelation¶
[65]:
sns.set(font_scale=2)[66]:
corr_threshold = 0.52corr = train_users_data.drop(['id'], axis=1).corr()lower_tri = corr.where(np.tril(np.ones(corr.shape),k=-1).astype(bool)) #creating lower triangular correlation matrixf = plt.figure(figsize=(20, 15))sns.heatmap(lower_tri, cmap="PiYG", annot=True, vmin=-1, vmax=1, ax=plt.gca()) #, annot_kws={"fontsize": 16})high_corr = []for column in train_users_data: if (column != 'id'): for col in train_users_data: if (col != 'id'): if abs(lower_tri[column][col]) > corr_threshold: high_corr.append((column, col, lower_tri[column][col]))high_corr = sorted(high_corr, key=lambda x: x[2], reverse=True)[67]:
sns.set(font_scale=1)[68]:
print("Number of columns containing high correlation:", len(set([x[0] for x in high_corr])))high_corrNumber of columns containing high correlation: 0
[68]:
[]
[69]:
# train_users_data = train_users_data.drop(['listed_count'], axis=1)# val_users_data = val_users_data.drop(['listed_count'], axis=1)# test_users_data = test_users_data.drop(['listed_count'], axis=1)# train_users_data = train_users_data.drop(['has_description'], axis=1)# val_users_data = val_users_data.drop(['has_description'], axis=1)# test_users_data = test_users_data.drop(['has_description'], axis=1)[70]:
train_users_data[70]:
| label | id | followers_count | tweet_count | following_count | account_age | descr_len | |
|---|---|---|---|---|---|---|---|
| 6625 | 0.0 | 1214018601683836928 | -0.128094 | -0.192545 | -0.113436 | -1.001878 | 1.043950 |
| 2489 | 0.0 | 109927809 | -0.078032 | -0.044047 | 0.038726 | 1.209547 | 1.111053 |
| 9919 | 0.0 | 2325624539 | -0.092643 | -0.120715 | 0.608965 | 0.315346 | 1.262037 |
| 6964 | 1.0 | 1362188147250061315 | -0.140643 | -0.198103 | -0.196107 | -1.250572 | -1.422111 |
| 3467 | 0.0 | 1105810614935531521 | -0.133711 | -0.059549 | -0.178369 | -0.819624 | 0.993622 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3325 | 0.0 | 1059189764 | -0.128705 | -0.129968 | 0.053210 | 0.557335 | 1.262037 |
| 1881 | 0.0 | 235022253 | -0.116065 | -0.046150 | 0.438414 | 1.001083 | 1.060726 |
| 4861 | 0.0 | 3053537383 | -0.140371 | -0.198013 | -0.189598 | 0.078234 | -0.616867 |
| 1175 | 0.0 | 4348813577 | -0.139805 | -0.197232 | -0.155423 | -0.090000 | -1.422111 |
| 8447 | 0.0 | 912359571947085824 | 3.903521 | -0.173889 | -0.135731 | -0.494128 | 0.909742 |
6946 rows × 7 columns
xxxxxxxxxxSplit users data for input and output¶
[71]:
train_users_data_X = train_users_data.drop(['label'], axis=1)train_users_data_Y = pd.concat([train_users_data['label']], axis=1)val_users_data_X = val_users_data.drop(['label'], axis=1)val_users_data_Y = pd.concat([val_users_data['label']], axis=1)test_users_data_X = test_users_data.drop(['label'], axis=1)test_users_data_Y = pd.concat([test_users_data['label']], axis=1)xxxxxxxxxxTweets data¶
xxxxxxxxxxxxxxxxxxxxLoading data¶
xxxxxxxxxxLoading users data to retriew label to add for each tweet¶
[72]:
dataset_name = "twitbot_22_preprocessed_common_users_ids"users_table_name = "users"BQ_TABLE_USERS = dataset_name + "." + users_table_nameusers_table_id = project_id + "." + BQ_TABLE_USERS[73]:
SQL_QUERY = f"""WITH human_records AS ( SELECT *, ROW_NUMBER() OVER () row_num FROM {BQ_TABLE_USERS} WHERE label = 'human' LIMIT 5000), bot_records AS ( SELECT *, ROW_NUMBER() OVER () row_num FROM {BQ_TABLE_USERS} WHERE label = 'bot' LIMIT 5000) SELECT * FROM human_records UNION ALL SELECT * FROM bot_records ORDER BY row_num;"""users_df1 = bqclient.query(SQL_QUERY).to_dataframe()users_df1 = users_df1.drop(['row_num'], axis=1)xxxxxxxxxxLoad tweets data¶
[74]:
dataset_name = "twitbot_22_preprocessed_common_users_ids"tweets_table_name = "tweets"BQ_TABLE_TWEETS = dataset_name + "." + tweets_table_nametweets_table_id = project_id + "." + BQ_TABLE_TWEETS[75]:
# comma-separated string of user IDs from users dataframeusers_df0 = pd.DataFrame(users_df1).copy()users_df0['id'] = users_df0['id'].astype(str)user_ids = users_df0['id'].to_list()# # SQL query to select records from the 'tweets' tableSQL_QUERY = f"""SELECT * FROM {BQ_TABLE_TWEETS} WHERE CAST(author_id AS STRING) IN ({str(user_ids)[1:-1]})"""tweets_df1 = bqclient.query(SQL_QUERY).to_dataframe()[76]:
# LIMIT RESULTS OPTIONSpd.set_option('display.max_rows', 100)# pd.set_option('display.max_rows', None)pd.set_option('display.max_column', None)pd.set_option('display.max_colwidth', None)[77]:
len(tweets_df1)[77]:
426163
xxxxxxxxxxAppend to tweets dataset author label (1/0, bot/human)¶
[78]:
user_id_label_dict = users_df1.set_index('id')['label'].to_dict()tweets_df1['author_label'] = tweets_df1['author_id'].map(user_id_label_dict)[79]:
# tweets_df1[80]:
org_tweet_df = pd.DataFrame(tweets_df1).copy()tweets_df = pd.DataFrame(org_tweet_df).copy()[81]:
tweets_df.columns[81]:
Index(['id', 'author_id', 'created_at', 'org_text', 'text', 'source',
'withheld', 'copyright_infringement', 'is_reply', 'geo_tagged',
'latitude', 'longitude', 'conversation_id', 'reply_settings',
'retweet_count', 'reply_count', 'like_count', 'quote_count',
'any_polls_attached', 'any_media_attached', 'possibly_sensitive',
'has_referenced_tweets', 'media_attached', 'no_cashtags', 'no_mentions',
'no_user_mentions', 'user_mentions', 'no_urls', 'contains_images',
'contains_annotations', 'no_hashtags', 'hashtags',
'context_annotations_domain_id', 'context_annotations_domain_name',
'context_annotations_entity_id', 'context_annotations_entity_name',
'author_label'],
dtype='object')[82]:
len(tweets_df)[82]:
426163
[83]:
tweets_df[83]:
| id | author_id | created_at | org_text | text | source | withheld | copyright_infringement | is_reply | geo_tagged | latitude | longitude | conversation_id | reply_settings | retweet_count | reply_count | like_count | quote_count | any_polls_attached | any_media_attached | possibly_sensitive | has_referenced_tweets | media_attached | no_cashtags | no_mentions | no_user_mentions | user_mentions | no_urls | contains_images | contains_annotations | no_hashtags | hashtags | context_annotations_domain_id | context_annotations_domain_name | context_annotations_entity_id | context_annotations_entity_name | author_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | t1485719054551855120 | 50338306 | 1643058000 | North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza | north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid | Sked Social | False | False | False | False | NaN | NaN | 1485719054551855120 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 0 | [] | <NA> | None | <NA> | None | human |
| 1 | t1466853630691377152 | 50338306 | 1638560133 | On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel. https://t.co/HlDeMqA2n8 | on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel | Sked Social | False | False | False | False | NaN | NaN | 1466853630691377152 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 0 | [] | <NA> | None | <NA> | None | human |
| 2 | t1446494364792987682 | 50338306 | 1633706105 | Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU | alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality | Sked Social | False | False | False | False | NaN | NaN | 1446494364792987682 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 0 | [] | <NA> | None | <NA> | None | human |
| 3 | t1486413634813276165 | 50338306 | 1643223601 | The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA | the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever | Sked Social | False | False | False | False | NaN | NaN | 1486413634813276165 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 0 | [] | <NA> | None | <NA> | None | human |
| 4 | t1471189609078050821 | 106526969 | 1639593910 | The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC | the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment | Twitter Web App | False | False | False | False | NaN | NaN | 1471189609078050821 | None | 1 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 0 | [] | <NA> | None | <NA> | None | human |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 426158 | t1486308593049780232 | 1442657001440489480 | 1643198557 | What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm | what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition | PromoRepublic | False | False | False | False | NaN | NaN | 1486308593049780232 | None | 1 | 0 | 1 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 0 | False | False | 12 | [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] | <NA> | None | <NA> | None | bot |
| 426159 | t1495030852715118594 | 1447249447424036866 | 1645278106 | SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY | scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto | Twitter Web App | False | False | False | False | NaN | NaN | 1495030852715118594 | None | 1 | 0 | 1 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 0 | False | False | 3 | [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] | <NA> | None | <NA> | None | bot |
| 426160 | t1495671246339850240 | 1468895392699846656 | 1645430788 | Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI | everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver | Twitter Web App | False | False | False | False | NaN | NaN | 1495671246339850240 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 7 | [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] | <NA> | None | <NA> | None | bot |
| 426161 | t1488487987289731074 | 1469339407345934338 | 1643718165 | Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU | cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization | ContentMX | False | False | False | False | NaN | NaN | 1488487987289731074 | None | 0 | 0 | 0 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 1 | False | False | 1 | [{'tagname': 'Microsoft'}] | <NA> | None | <NA> | None | human |
| 426162 | t1491185930106978304 | 1477407428950036485 | 1644361405 | OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. | ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions | TweetDeck | False | False | False | False | NaN | NaN | 1491185930106978304 | None | 0 | 0 | 2 | 0 | False | False | False | False | False | 0 | 0 | 0 | [] | 0 | False | False | 1 | [{'tagname': '100Devs'}] | <NA> | None | <NA> | None | bot |
426163 rows × 37 columns
xxxxxxxxxxData preparation¶
[84]:
def drop_columns(df, columns): for column_name in columns: curr_df_all_cols = df.columns if column_name in curr_df_all_cols: df = df.drop([column_name], axis=1) return df[85]:
def encode_not_numeric_columns(df): for column_name in df: if not is_numeric_dtype(df[column_name]): unique_values_dict = dict(enumerate(df[column_name].unique())) unique_values_dict = dict((v, k) for k, v in unique_values_dict.items()) df[column_name] = df[column_name].map(unique_values_dict) return dfxxxxxxxxxxNull and NaN statistics¶
[86]:
for col_name in tweets_df: count1 = pd.isnull(tweets_df[col_name]).sum() print(col_name + ": " + str(count1))id: 0 author_id: 0 created_at: 0 org_text: 0 text: 0 source: 0 withheld: 0 copyright_infringement: 0 is_reply: 0 geo_tagged: 0 latitude: 422884 longitude: 422884 conversation_id: 0 reply_settings: 420428 retweet_count: 0 reply_count: 0 like_count: 0 quote_count: 0 any_polls_attached: 0 any_media_attached: 0 possibly_sensitive: 0 has_referenced_tweets: 0 media_attached: 0 no_cashtags: 0 no_mentions: 0 no_user_mentions: 0 user_mentions: 0 no_urls: 0 contains_images: 0 contains_annotations: 0 no_hashtags: 0 hashtags: 0 context_annotations_domain_id: 426163 context_annotations_domain_name: 426163 context_annotations_entity_id: 426163 context_annotations_entity_name: 426163 author_label: 0
[87]:
for col_name in tweets_df: count1 = pd.isnull(tweets_df[col_name]).sum() print(col_name + ": " + str(count1))id: 0 author_id: 0 created_at: 0 org_text: 0 text: 0 source: 0 withheld: 0 copyright_infringement: 0 is_reply: 0 geo_tagged: 0 latitude: 422884 longitude: 422884 conversation_id: 0 reply_settings: 420428 retweet_count: 0 reply_count: 0 like_count: 0 quote_count: 0 any_polls_attached: 0 any_media_attached: 0 possibly_sensitive: 0 has_referenced_tweets: 0 media_attached: 0 no_cashtags: 0 no_mentions: 0 no_user_mentions: 0 user_mentions: 0 no_urls: 0 contains_images: 0 contains_annotations: 0 no_hashtags: 0 hashtags: 0 context_annotations_domain_id: 426163 context_annotations_domain_name: 426163 context_annotations_entity_id: 426163 context_annotations_entity_name: 426163 author_label: 0
xxxxxxxxxxreply_settings¶
xxxxxxxxxxTwitter documentation for that field mantions that: If the field isn’t specified, it will default to everyone.
[88]:
set(tweets_df.loc[tweets_df['reply_settings'].notna()]['reply_settings'])[88]:
{'everyone', 'following', 'mentionedUsers'}[89]:
set(tweets_df['reply_settings'])[89]:
{None, 'everyone', 'following', 'mentionedUsers'}xxxxxxxxxxReplace not specified value / None with 'everyone'¶
[90]:
tweets_df['reply_settings'].fillna('everyone', inplace=True)xxxxxxxxxxRemove columns with most lacking value¶
[91]:
most_nan_columns = ['context_annotations_domain_id', 'context_annotations_domain_name', 'context_annotations_entity_id', 'context_annotations_entity_name', 'latitude', 'longitude']tweets_df = drop_columns(tweets_df, most_nan_columns)xxxxxxxxxxEncoding of non-numeric information which will be used by model¶
xxxxxxxxxxEncode boolean columns¶
[92]:
boolean_columns = ['withheld', 'copyright_infringement', 'is_reply', 'geo_tagged', 'any_polls_attached', 'any_media_attached', 'possibly_sensitive', 'has_referenced_tweets', 'media_attached', 'contains_images', 'contains_annotations'][93]:
# Remap the values of the dataframefor col_name in boolean_columns: tweets_df[col_name] = tweets_df[col_name].map({True:1,False:0}) # Remap label values human/bot for 0/1label_col = "author_label"tweets_df[label_col] = tweets_df[label_col].map({"human":0,"bot":1})xxxxxxxxxxEncode reply_settings categorical column¶
[94]:
reply_settings_dict = {'everyone' : 0, 'following' : 1, 'mentionedUsers' : 2}[95]:
tweets_df['reply_settings'] = tweets_df['reply_settings'].map(reply_settings_dict)[96]:
set(tweets_df['reply_settings'])[96]:
{0, 1, 2}[97]:
tweets_df[97]:
| id | author_id | created_at | org_text | text | source | withheld | copyright_infringement | is_reply | geo_tagged | conversation_id | reply_settings | retweet_count | reply_count | like_count | quote_count | any_polls_attached | any_media_attached | possibly_sensitive | has_referenced_tweets | media_attached | no_cashtags | no_mentions | no_user_mentions | user_mentions | no_urls | contains_images | contains_annotations | no_hashtags | hashtags | author_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | t1485719054551855120 | 50338306 | 1643058000 | North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza | north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid | Sked Social | 0 | 0 | 0 | 0 | 1485719054551855120 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 |
| 1 | t1466853630691377152 | 50338306 | 1638560133 | On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel. https://t.co/HlDeMqA2n8 | on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel | Sked Social | 0 | 0 | 0 | 0 | 1466853630691377152 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 |
| 2 | t1446494364792987682 | 50338306 | 1633706105 | Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU | alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality | Sked Social | 0 | 0 | 0 | 0 | 1446494364792987682 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 |
| 3 | t1486413634813276165 | 50338306 | 1643223601 | The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA | the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever | Sked Social | 0 | 0 | 0 | 0 | 1486413634813276165 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 |
| 4 | t1471189609078050821 | 106526969 | 1639593910 | The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC | the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment | Twitter Web App | 0 | 0 | 0 | 0 | 1471189609078050821 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 426158 | t1486308593049780232 | 1442657001440489480 | 1643198557 | What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm | what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition | PromoRepublic | 0 | 0 | 0 | 0 | 1486308593049780232 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 12 | [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] | 1 |
| 426159 | t1495030852715118594 | 1447249447424036866 | 1645278106 | SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY | scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto | Twitter Web App | 0 | 0 | 0 | 0 | 1495030852715118594 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 3 | [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] | 1 |
| 426160 | t1495671246339850240 | 1468895392699846656 | 1645430788 | Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI | everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver | Twitter Web App | 0 | 0 | 0 | 0 | 1495671246339850240 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 7 | [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] | 1 |
| 426161 | t1488487987289731074 | 1469339407345934338 | 1643718165 | Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU | cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization | ContentMX | 0 | 0 | 0 | 0 | 1488487987289731074 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 1 | [{'tagname': 'Microsoft'}] | 0 |
| 426162 | t1491185930106978304 | 1477407428950036485 | 1644361405 | OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. | ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions | TweetDeck | 0 | 0 | 0 | 0 | 1491185930106978304 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 1 | [{'tagname': '100Devs'}] | 1 |
426163 rows × 31 columns
xxxxxxxxxxExtract some information from dataframe to new columns¶
xxxxxxxxxxTweet length¶
[98]:
tweets_df['cleaned_tweet_len'] = tweets_df['text'].apply(len).astype(float)tweets_df['org_tweet_len'] = tweets_df['org_text'].apply(len).astype(float)[99]:
tweets_df[99]:
| id | author_id | created_at | org_text | text | source | withheld | copyright_infringement | is_reply | geo_tagged | conversation_id | reply_settings | retweet_count | reply_count | like_count | quote_count | any_polls_attached | any_media_attached | possibly_sensitive | has_referenced_tweets | media_attached | no_cashtags | no_mentions | no_user_mentions | user_mentions | no_urls | contains_images | contains_annotations | no_hashtags | hashtags | author_label | cleaned_tweet_len | org_tweet_len | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | t1485719054551855120 | 50338306 | 1643058000 | North Carolina Governor Roy Cooper on Friday sought federal help in the Charlotte area as hospitals across the state face record numbers of patients hospitalized with COVID-19. https://t.co/1CZNI3pPza | north carolina governor roy cooper on friday sought federal help in the charlotte area as hospitals across the state face record numbers of patients hospitalized with covid | Sked Social | 0 | 0 | 0 | 0 | 1485719054551855120 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 | 172.0 | 200.0 |
| 1 | t1466853630691377152 | 50338306 | 1638560133 | On this visit to St. Louis, I wanted to focus on Union Station, a magnificent property located in the heart of the city which features an aquarium, numerous dining options, a Ferris wheel and, of course, the Union Station Hotel. https://t.co/HlDeMqA2n8 | on this visit to st louis i wanted to focus on union station a magnificent property located in the heart of the city which features an aquarium numerous dining options a ferris wheel and of course the union station hotel | Sked Social | 0 | 0 | 0 | 0 | 1466853630691377152 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 | 220.0 | 253.0 |
| 2 | t1446494364792987682 | 50338306 | 1633706105 | Alice Dunbar-Nelson was a racially-mixed bisexual poet and author whose career spanned multiple literary genres and culminated during the Harlem Renaissance. She was also a lifelong educator and activist who fought for women’s suffrage and equality https://t.co/WqsxXTZKqM https://t.co/XQevKfSRyU | alice dunbarnelson was a raciallymixed bisexual poet and author whose career spanned multiple literary genres and culminated during the harlem renaissance she was also a lifelong educator and activist who fought for womens suffrage and equality | Sked Social | 0 | 0 | 0 | 0 | 1446494364792987682 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 | 244.0 | 297.0 |
| 3 | t1486413634813276165 | 50338306 | 1643223601 | The new year is finally here. We are headed into the third year of a devastating pandemic in a nation that seems more divided than ever. https://t.co/EiWrMvSCPA | the new year is finally here we are headed into the third year of a devastating pandemic in a nation that seems more divided than ever | Sked Social | 0 | 0 | 0 | 0 | 1486413634813276165 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 | 134.0 | 160.0 |
| 4 | t1471189609078050821 | 106526969 | 1639593910 | The warning letter highlights shortfalls in risk assessment, corrective and preventive action, complaint handling, device recalls and adverse event reporting at the Northridge, California, headquarters of Medtronic's diabetes segment. https://t.co/fMRpEWJPWC | the warning letter highlights shortfalls in risk assessment corrective and preventive action complaint handling device recalls and adverse event reporting at the northridge california headquarters of medtronics diabetes segment | Twitter Web App | 0 | 0 | 0 | 0 | 1471189609078050821 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 0 | [] | 0 | 227.0 | 258.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 426158 | t1486308593049780232 | 1442657001440489480 | 1643198557 | What if you got enough of the small points, that you didn't have to ace the midterm or the final. What's your secret studying hack?\n#beginagainandwin #thinkdifferent #motivation #help #passtheclass #learn #theblackmancan #B1 #youareenough #positivity #timemanagement #repetition�� https://t.co/EnTeJyo7jm | what if you got enough of the small points that you didnt have to ace the midterm or the final whats your secret studying hack beginagainandwin thinkdifferent motivation help passtheclass learn theblackmancan b youareenough positivity timemanagement repetition | PromoRepublic | 0 | 0 | 0 | 0 | 1486308593049780232 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 12 | [{'tagname': 'beginagainandwin'}, {'tagname': 'thinkdifferent'}, {'tagname': 'motivation'}, {'tagname': 'help'}, {'tagname': 'passtheclass'}, {'tagname': 'learn'}, {'tagname': 'theblackmancan'}, {'tagname': 'B1'}, {'tagname': 'youareenough'}, {'tagname': 'positivity'}, {'tagname': 'timemanagement'}, {'tagname': 'repetition'}] | 1 | 260.0 | 304.0 |
| 426159 | t1495030852715118594 | 1447249447424036866 | 1645278106 | SCOPE OF CLOUD STORAGE \n\n#Mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers. M-coin network is building the most reliable #decentralized cloud data storage that ensures low costs and immutable security for the user’s data.\n\n#crypto https://t.co/XFshg0SVeY | scope of cloud storage mcoin aims to disrupt the cloud storage industry by eliminating the current industry barriers mcoin network is building the most reliable decentralized cloud data storage that ensures low costs and immutable security for the users data crypto | Twitter Web App | 0 | 0 | 0 | 0 | 1495030852715118594 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 3 | [{'tagname': 'Mcoin'}, {'tagname': 'decentralized'}, {'tagname': 'crypto'}] | 1 | 265.0 | 299.0 |
| 426160 | t1495671246339850240 | 1468895392699846656 | 1645430788 | Everything in life is somewhere else, and you get there in a car.\n“The road goes on forever and the party never ends.”\nLive your life by a Compass not a clock\n#Rentalcars #Cab #self #driving #longdrive #safe #mydriver #\n��https://t.co/qYwHNQ8eQI\n��8886399949 https://t.co/dryhxD1XJI | everything in life is somewhere else and you get there in a car the road goes on forever and the party never ends live your life by a compass not a clock rentalcars cab self driving longdrive safe mydriver | Twitter Web App | 0 | 0 | 0 | 0 | 1495671246339850240 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 7 | [{'tagname': 'Rentalcars'}, {'tagname': 'Cab'}, {'tagname': 'self'}, {'tagname': 'driving'}, {'tagname': 'longdrive'}, {'tagname': 'safe'}, {'tagname': 'mydriver'}] | 1 | 205.0 | 282.0 |
| 426161 | t1488487987289731074 | 1469339407345934338 | 1643718165 | Cybersecurity and privacy must be priorities for nonprofits—the risk of ignoring increased cyberthreats is too great. Access this e-book to find out how you can use #Microsoft 365 to build agile security frameworks in your organization. https://t.co/jhMUXYhcNU | cybersecurity and privacy must be priorities for nonprofitsthe risk of ignoring increased cyberthreats is too great access this ebook to find out how you can use microsoft to build agile security frameworks in your organization | ContentMX | 0 | 0 | 0 | 0 | 1488487987289731074 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 1 | 0 | 0 | 1 | [{'tagname': 'Microsoft'}] | 0 | 227.0 | 260.0 |
| 426162 | t1491185930106978304 | 1477407428950036485 | 1644361405 | OK #100Devs I need webcam recommendations. Doesn't have to be cheap, but I don't want to pay $500 either. \n\nRight now I am looking at the Logitech C920x HD Pro Webcam. Open to all suggestions. | ok devs i need webcam recommendations doesnt have to be cheap but i dont want to pay either right now i am looking at the logitech cx hd pro webcam open to all suggestions | TweetDeck | 0 | 0 | 0 | 0 | 1491185930106978304 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [] | 0 | 0 | 0 | 1 | [{'tagname': '100Devs'}] | 1 | 171.0 | 195.0 |
426163 rows × 33 columns
xxxxxxxxxxTime of the date (in min, UTC)¶
[100]:
def convert_unixtime_to_datetime(a): return datetime.utcfromtimestamp(a)def get_time_in_minutes(unix_time): h = convert_unixtime_to_datetime(unix_time).hour m = convert_unixtime_to_datetime(unix_time).minute all_minutes = (h * 60) + m return all_minutes[101]:
tweets_df['time_of_creation'] = tweets_df.apply(lambda x: get_time_in_minutes(x.created_at), axis=1).astype(float)xxxxxxxxxxAdd days different between tweets, days_since_prev_tweet¶
[102]:
tweets_df['creation_date'] = tweets_df.apply(lambda x: convert_unixtime_to_datetime(x.created_at), axis=1)tweets_df.sort_values(by=['author_id', 'creation_date'], inplace=True)[103]:
grouped = tweets_df.groupby('author_id')tweets_df['days_since_prev_tweet'] = grouped['creation_date'].diff().dt.daystweets_df['days_since_prev_tweet'].fillna(0, inplace=True)tweets_df = tweets_df.drop(['creation_date'], axis=1)[104]:
# Revert to original order tweets_df.sort_index(inplace=True)xxxxxxxxxxRemove some leftover special characters¶
[105]:
tweets_df['text'] = tweets_df['text'].str.replace('|', '', regex=False)[106]:
tweets_df.columns[106]:
Index(['id', 'author_id', 'created_at', 'org_text', 'text', 'source',
'withheld', 'copyright_infringement', 'is_reply', 'geo_tagged',
'conversation_id', 'reply_settings', 'retweet_count', 'reply_count',
'like_count', 'quote_count', 'any_polls_attached', 'any_media_attached',
'possibly_sensitive', 'has_referenced_tweets', 'media_attached',
'no_cashtags', 'no_mentions', 'no_user_mentions', 'user_mentions',
'no_urls', 'contains_images', 'contains_annotations', 'no_hashtags',
'hashtags', 'author_label', 'cleaned_tweet_len', 'org_tweet_len',
'time_of_creation', 'days_since_prev_tweet'],
dtype='object')[107]:
add_tweets_feature_shp_values = ['is_reply', 'time_of_creation', 'no_urls', 'no_hashtags', 'org_tweet_len', 'no_mentions', 'any_media_attached', 'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive', 'no_user_mentions'][108]:
col_to_leave = ['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at'] + add_tweets_feature_shp_valuestweets_df = tweets_df[col_to_leave][109]:
tweets_df.columns[109]:
Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
'org_tweet_len', 'no_mentions', 'any_media_attached',
'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
'no_user_mentions'],
dtype='object')xxxxxxxxxxData split for training, validation and testing of tweets data based on users data split¶
[110]:
users_train_set_users_id = train_users_data['id']users_val_set_users_id = val_users_data['id']users_test_set_users_id = test_users_data['id'][111]:
train_tweets_data = tweets_df[tweets_df['author_id'].isin(users_train_set_users_id)]val_tweets_data = tweets_df[tweets_df['author_id'].isin(users_val_set_users_id)]test_tweets_data = tweets_df[tweets_df['author_id'].isin(users_test_set_users_id)][112]:
train_tweets_data1 = pd.DataFrame(train_tweets_data).copy()val_tweets_data1 = pd.DataFrame(val_tweets_data).copy()test_tweets_data1 = pd.DataFrame(test_tweets_data).copy()xxxxxxxxxxAnalysis¶
[113]:
train_tweets_data.columns[113]:
Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
'org_tweet_len', 'no_mentions', 'any_media_attached',
'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
'no_user_mentions'],
dtype='object')[114]:
columns_to_standardize = [ #'id', # 'author_id', # 'created_at', # 'text', 'days_since_prev_tweet', 'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags', 'org_tweet_len', 'no_mentions', 'any_media_attached', 'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive', 'no_user_mentions']xxxxxxxxxxRemove unnecessery columns¶
[115]:
train_tweets_data.columns[115]:
Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
'org_tweet_len', 'no_mentions', 'any_media_attached',
'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
'no_user_mentions'],
dtype='object')[116]:
unnecessery_col_to_remove = [# 'id', #'author_id', # need later #'created_at', # nead to sort tweets per user later 'org_text', 'source', 'conversation_id', 'user_mentions', 'hashtags', 'created_at_datetime'][117]:
train_tweets_data = drop_columns(train_tweets_data, unnecessery_col_to_remove)val_tweets_data = drop_columns(val_tweets_data, unnecessery_col_to_remove)test_tweets_data = drop_columns(test_tweets_data, unnecessery_col_to_remove)[118]:
columns_to_standardize = [ col for col in columns_to_standardize if col in train_tweets_data.columns]xxxxxxxxxxData type conversion (to float)¶
[119]:
for column_name in columns_to_standardize: train_tweets_data[column_name] = train_tweets_data[column_name].astype(float) val_tweets_data[column_name] = val_tweets_data[column_name].astype(float) test_tweets_data[column_name] = test_tweets_data[column_name].astype(float)xxxxxxxxxxStandardize other tweets data by column range of training set¶
[120]:
def standardize_column(df, col_name, mean_training, std_training): df_cp = df.copy() df_cp[col_name] = (df[col_name] - mean_training) / std_training return df_cpxxxxxxxxxxStandardize¶
[121]:
for column_name in columns_to_standardize: mean_training = train_tweets_data[column_name].mean() std_training = train_tweets_data[column_name].std() print(column_name) print("mean_training = ", mean_training) print("std_training = ", std_training) print() train_tweets_data = standardize_column(train_tweets_data, column_name, mean_training, std_training) val_tweets_data = standardize_column(val_tweets_data, column_name, mean_training, std_training) test_tweets_data = standardize_column(test_tweets_data, column_name, mean_training, std_training)days_since_prev_tweet mean_training = 10.150619688444115 std_training = 83.86048614893724 is_reply mean_training = 0.15228149709017305 std_training = 0.35929412990954285 time_of_creation mean_training = 832.0637669213666 std_training = 379.59396560713145 no_urls mean_training = 0.5035845212495471 std_training = 0.5677704754324685 no_hashtags mean_training = 1.8141632627286233 std_training = 3.3132555452757417 org_tweet_len mean_training = 171.79513887734856 std_training = 76.81292827031992 no_mentions mean_training = 0.011390036460081694 std_training = 0.19828870716436514 any_media_attached mean_training = 0.005327758519262024 std_training = 0.07279691697847118 contains_annotations mean_training = 0.006201869867088545 std_training = 0.07850749748981518 has_referenced_tweets mean_training = 0.0032837338846106547 std_training = 0.0572098055796236 possibly_sensitive mean_training = 0.004453647171435504 std_training = 0.06658698772772154 no_user_mentions mean_training = 0.9017738145488023 std_training = 1.3509284665873897
xxxxxxxxxxText preprocessing¶
xxxxxxxxxxCreate backup column for text before processing¶
[122]:
train_tweets_data.loc[:, 'text_np'] = train_tweets_data['text'][123]:
val_tweets_data.loc[:, 'text_np'] = val_tweets_data['text'][124]:
test_tweets_data.loc[:, 'text_np'] = test_tweets_data['text']xxxxxxxxxxRemove stop words and tokenize tweet text and use word embeddings¶
[125]:
from tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequences[126]:
!pip install nltkCollecting nltk Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB) Requirement already satisfied: click in /opt/conda/lib/python3.7/site-packages (from nltk) (8.1.6) Requirement already satisfied: joblib in /opt/conda/lib/python3.7/site-packages (from nltk) (1.3.1) Collecting regex>=2021.8.3 (from nltk) Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/63/78/ed291d95116695b8b5d7469a931d7c2e83d942df0853915ee504cee98bcf/regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata Using cached regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB) Requirement already satisfied: tqdm in /opt/conda/lib/python3.7/site-packages (from nltk) (4.63.0) Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from click->nltk) (4.11.4) Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (3.15.0) Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->click->nltk) (4.7.1) Using cached regex-2023.8.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (758 kB) Installing collected packages: regex, nltk Successfully installed nltk-3.8.1 regex-2023.8.8
[127]:
import nltkfrom nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsnltk.download('stopwords')[nltk_data] Downloading package stopwords to [nltk_data] /home/jupyter/nltk_data... [nltk_data] Package stopwords is already up-to-date!
[127]:
True
[128]:
nltk.download('punkt')[nltk_data] Downloading package punkt to /home/jupyter/nltk_data... [nltk_data] Package punkt is already up-to-date!
[128]:
True
[129]:
train_tweets_data.loc[:, 'text_tk'] = train_tweets_data['text'].apply(lambda text : word_tokenize(text))val_tweets_data.loc[:, 'text_tk'] = val_tweets_data['text'].apply(lambda text : word_tokenize(text))test_tweets_data.loc[:, 'text_tk'] = test_tweets_data['text'].apply(lambda text : word_tokenize(text))xxxxxxxxxxRemove stopwords¶
[130]:
stop_words = stopwords.words('english')[131]:
train_tweets_data.loc[:, 'text_tk'] = train_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])val_tweets_data.loc[:, 'text_tk'] = val_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])test_tweets_data.loc[:, 'text_tk'] = test_tweets_data['text_tk'].apply(lambda words: [word for word in words if word not in stop_words])xxxxxxxxxxTo not have additional spaces¶
[132]:
train_tweets_data.loc[:, 'text'] = train_tweets_data['text_tk'].apply(lambda words : ' '.join(words))val_tweets_data.loc[:, 'text'] = val_tweets_data['text_tk'].apply(lambda words : ' '.join(words))test_tweets_data.loc[:, 'text'] = test_tweets_data['text_tk'].apply(lambda words : ' '.join(words))xxxxxxxxxxWord Embedding¶
[133]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip# !unzip glove.6B.zipxxxxxxxxxxFor training set¶
[134]:
tokenizer = Tokenizer()tokenizer.fit_on_texts(train_tweets_data['text'])word_index = tokenizer.word_indexnum_words = len(word_index) + 1 # adding 1 for padding tokenembedding_dim = 100 # using GloVe 100-dimensional vectorsxxxxxxxxxxInteger encode text¶
[135]:
train_tweets_data.loc[:, 'text_seq'] = train_tweets_data['text'].apply(lambda text: tokenizer.texts_to_sequences([text])[0])val_tweets_data.loc[:, 'text_seq'] = val_tweets_data['text'].apply(lambda text: tokenizer.texts_to_sequences([text])[0])test_tweets_data.loc[:, 'text_seq'] = test_tweets_data['text'].apply(lambda text: tokenizer.texts_to_sequences([text])[0])xxxxxxxxxxPad encoded text to a max length¶
[136]:
max_length_train = train_tweets_data['text_seq'].apply(len).max()max_length_val = val_tweets_data['text_seq'].apply(len).max()max_length_test = test_tweets_data['text_seq'].apply(len).max()[137]:
max_length_train[137]:
76
[138]:
max_length_val[138]:
37
[139]:
max_length_test[139]:
36
[140]:
train_tweets_data['text_seq'].apply(len).mean()[140]:
14.39626491888712
[141]:
val_tweets_data['text_seq'].apply(len).mean()[141]:
12.773751671504517
[142]:
test_tweets_data['text_seq'].apply(len).mean()[142]:
12.53067008570079
[143]:
max_length = 15 # max_length_train[144]:
train_tweets_data.loc[:, 'text_seq_ps'] = train_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))val_tweets_data.loc[:, 'text_seq_ps'] = val_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))test_tweets_data.loc[:, 'text_seq_ps'] = test_tweets_data['text_seq'].apply(lambda encoded: pad_sequences([encoded], maxlen=max_length, padding='post'))xxxxxxxxxxLoad GloVe embedding¶
[145]:
embeddings_index = {}with open('glove.6B.100d.txt', encoding='utf-8') as f: for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs[146]:
print('Loaded %s word vectors.' % len(embeddings_index))Loaded 400000 word vectors.
xxxxxxxxxxCreating a weight matrix for words¶
[147]:
embedding_matrix = np.zeros((num_words, embedding_dim))for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vectorxxxxxxxxxxCorrelation of numeric tweets data¶
[148]:
sns.set(font_scale=1.5)[149]:
corr_threshold = 0.52corr = train_tweets_data[columns_to_standardize].corr()lower_tri = corr.where(np.tril(np.ones(corr.shape),k=-1).astype(bool)) #creating lower triangular correlation matrixf = plt.figure(figsize=(20, 15))sns.heatmap(lower_tri, cmap="PiYG", annot=True, vmin=-1, vmax=1, ax=plt.gca()) #, annot_kws={"fontsize": 16})high_corr = []for column in train_tweets_data[columns_to_standardize]: for col in train_tweets_data[columns_to_standardize]: if abs(lower_tri[column][col]) > corr_threshold: high_corr.append((column, col, lower_tri[column][col]))high_corr = sorted(high_corr, key=lambda x: x[2], reverse=True)[150]:
sns.set(font_scale=1)[151]:
print("Number of columns containing high correlation:", len(set([x[0] for x in high_corr])))high_corrNumber of columns containing high correlation: 0
[151]:
[]
[152]:
# train_tweets_data = train_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)[153]:
# val_tweets_data = val_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)[154]:
# test_tweets_data = test_tweets_data.drop(['cleaned_tweet_len', 'quote_count'], axis=1)xxxxxxxxxxSplit tweets data for input and output¶
xxxxxxxxxxAnd convert inputs to tensors¶
[155]:
train_tweets_data.columns[155]:
Index(['id', 'author_id', 'text', 'days_since_prev_tweet', 'created_at',
'is_reply', 'time_of_creation', 'no_urls', 'no_hashtags',
'org_tweet_len', 'no_mentions', 'any_media_attached',
'contains_annotations', 'has_referenced_tweets', 'possibly_sensitive',
'no_user_mentions', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'],
dtype='object')[156]:
compact_train_tweets_text_data = []compact_train_tweets_add_feat_data = []for author_id, group in train_tweets_data.groupby('author_id'): group = group.sort_values('created_at', ascending=False) author_tweets_text = [] author_tweets_add_feat = [] for index, row in group.iterrows(): row_arr = [] row_arr.append(row['days_since_prev_tweet']) row_arr.append(row['is_reply']) row_arr.append(row['time_of_creation']) row_arr.append(row['no_urls']) row_arr.append(row['no_hashtags']) row_arr.append(row['org_tweet_len']) row_arr.append(row['no_mentions']) row_arr.append(row['any_media_attached']) row_arr.append(row['contains_annotations']) row_arr.append(row['has_referenced_tweets']) row_arr.append(row['possibly_sensitive']) row_arr.append(row['no_user_mentions']) author_tweets_add_feat.append(row_arr) author_tweets_text.append(row['text_seq_ps'][0]) compact_train_tweets_text_data.append(author_tweets_text) compact_train_tweets_add_feat_data.append(author_tweets_add_feat) compact_train_tweets_text_data = np.array(compact_train_tweets_text_data)compact_train_tweets_add_feat_data = np.array(compact_train_tweets_add_feat_data)/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[157]:
compact_val_tweets_text_data = []compact_val_tweets_add_feat_data = []for author_id, group in val_tweets_data.groupby('author_id'): group = group.sort_values('created_at', ascending=False) author_tweets_text = [] author_tweets_add_feat = [] for index, row in group.iterrows(): row_arr = [] row_arr.append(row['days_since_prev_tweet']) row_arr.append(row['is_reply']) row_arr.append(row['time_of_creation']) row_arr.append(row['no_urls']) row_arr.append(row['no_hashtags']) row_arr.append(row['org_tweet_len']) row_arr.append(row['no_mentions']) row_arr.append(row['any_media_attached']) row_arr.append(row['contains_annotations']) row_arr.append(row['has_referenced_tweets']) row_arr.append(row['possibly_sensitive']) row_arr.append(row['no_user_mentions']) author_tweets_add_feat.append(row_arr) author_tweets_text.append(row['text_seq_ps'][0]) compact_val_tweets_text_data.append(author_tweets_text) compact_val_tweets_add_feat_data.append(author_tweets_add_feat) compact_val_tweets_text_data = np.array(compact_val_tweets_text_data)compact_val_tweets_add_feat_data = np.array(compact_val_tweets_add_feat_data)/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[158]:
compact_test_tweets_text_data = []compact_test_tweets_add_feat_data = []for author_id, group in test_tweets_data.groupby('author_id'): group = group.sort_values('created_at', ascending=False) author_tweets_text = [] author_tweets_add_feat = [] for index, row in group.iterrows(): row_arr = [] row_arr.append(row['days_since_prev_tweet']) row_arr.append(row['is_reply']) row_arr.append(row['time_of_creation']) row_arr.append(row['no_urls']) row_arr.append(row['no_hashtags']) row_arr.append(row['org_tweet_len']) row_arr.append(row['no_mentions']) row_arr.append(row['any_media_attached']) row_arr.append(row['contains_annotations']) row_arr.append(row['has_referenced_tweets']) row_arr.append(row['possibly_sensitive']) row_arr.append(row['no_user_mentions']) author_tweets_add_feat.append(row_arr) author_tweets_text.append(row['text_seq_ps'][0]) compact_test_tweets_text_data.append(author_tweets_text) compact_test_tweets_add_feat_data.append(author_tweets_add_feat) compact_test_tweets_text_data = np.array(compact_test_tweets_text_data)compact_test_tweets_add_feat_data = np.array(compact_test_tweets_add_feat_data)/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:26: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:27: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
[159]:
x
# train_tweets_add_feat_data_X = tf.convert_to_tensor(train_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)# train_tweets_text_data_X = train_tweets_data['text_seq_ps'].apply(lambda x: x[0])# train_tweets_text_data_X_tensor = tf.convert_to_tensor(train_tweets_text_data_X.tolist(), dtype=tf.float32)# val_tweets_add_feat_data_X = tf.convert_to_tensor(val_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)# val_tweets_text_data_X = val_tweets_data['text_seq_ps'].apply(lambda x: x[0])# val_tweets_text_data_X_tensor = tf.convert_to_tensor(val_tweets_text_data_X.tolist(), dtype=tf.float32)# test_tweets_add_feat_data_X = tf.convert_to_tensor(test_tweets_data.drop(['id', 'author_id', 'text', 'text_np', 'text_tk', 'text_seq', 'text_seq_ps'], axis=1).values, dtype=tf.float32)# test_tweets_text_data_X = test_tweets_data['text_seq_ps'].apply(lambda x: x[0])# test_tweets_text_data_X_tensor = tf.convert_to_tensor(test_tweets_text_data_X.tolist(), dtype=tf.float32)xxxxxxxxxxReformat tweets data¶
[160]:
compact_train_tweets_text_data.shape[160]:
(6946,)
[161]:
np.array(compact_train_tweets_text_data[0]).shape[161]:
(66, 15)
[162]:
max_l = 0for arr_user_tweets_feat in compact_train_tweets_text_data: curr = np.array(arr_user_tweets_feat).shape[0] if curr > max_l: max_l = currmax_l[162]:
3397
[163]:
max_user_tweets_num = 20 # max_l[164]:
compact_train_tweets_text_data_padded = pad_sequences(compact_train_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')compact_val_tweets_text_data_padded = pad_sequences(compact_val_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')compact_test_tweets_text_data_padded = pad_sequences(compact_test_tweets_text_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')len(compact_train_tweets_text_data_padded)[164]:
6946
[165]:
compact_train_tweets_add_feat_data_padded = pad_sequences(compact_train_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')compact_val_tweets_add_feat_data_padded = pad_sequences(compact_val_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')compact_test_tweets_add_feat_data_padded = pad_sequences(compact_test_tweets_add_feat_data, maxlen=max_user_tweets_num, padding='post', truncating='post', dtype='float32')len(compact_train_tweets_add_feat_data_padded)[165]:
6946
[166]:
train_users_data_Y.shape[166]:
(6946, 1)
[167]:
np.array(train_users_data_X).shape[167]:
(6946, 6)
[168]:
np.array(compact_train_tweets_text_data_padded[0]).shape[168]:
(20, 15)
xxxxxxxxxxxxxxxxxxxxDNN models¶
xxxxxxxxxxFunction to load a saved neural network model¶
[169]:
from keras.models import load_modeldef load_model_from_file(filepath): model = load_model(filepath) return modelxxxxxxxxxxSummary of metrics based on real and predicted data by the network¶
[170]:
def get_model_metrics(test_Y, out_Y): accuracy = accuracy_score(test_Y, out_Y) print('Accuracy: {}'.format(accuracy)) # precision tp / (tp + fp) precision = precision_score(test_Y, out_Y, average=None) print('Precision: {}'.format(precision)) # recall: tp / (tp + fn) recall = recall_score(test_Y, out_Y) print('Recall: {}'.format(recall)) # f1: 2 tp / (2 tp + fp + fn) f1 = f1_score(test_Y, out_Y) print('F1 score: %f' % f1) # ROC AUC auc = roc_auc_score(test_Y, out_Y) print('ROC AUC: %f' % auc) return (accuracy, precision, recall, f1, auc)xxxxxxxxxxCreating a confusion matrix¶
[171]:
def create_confusion_matrix(test_Y, out_Y): cm = sklearn.metrics.confusion_matrix(test_Y, out_Y) group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()] group_percentages = ["{0:.2%}".format(value) for value in cm.flatten()/np.sum(cm)] labels = [f"{v1}\n\n{v2}" for v1, v2 in zip(group_counts,group_percentages)] labels = np.asarray(labels).reshape(2,2) plt.figure() fig = plt.figure(figsize=(4,4)) ax = fig.add_subplot(111) sns.heatmap( cm, annot=labels, annot_kws={"size": 12}, fmt='', cmap=plt.cm.Blues, cbar=False ) ax.set_title("Confusion matrix", fontsize=14) ax.set_xticklabels(ax.get_xticklabels(), fontsize=12) ax.set_yticklabels(ax.get_yticklabels(), fontsize=12) ax.set_ylabel('True', fontsize=12) ax.set_xlabel('Predicted', fontsize=12) fig.show()xxxxxxxxxxNeural network models¶
[172]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Input, Concatenate, concatenate, Masking[173]:
# EarlyStoppingdef early_stop(metric='val_accuracy', mode = 'max', patience=50): return EarlyStopping(monitor='val_accuracy', patience=patience, restore_best_weights=True, mode=mode)# PlotLossesdef plot_losses(): return PlotLossesCallback()# ModelCheckpointdef checkpoint_callback(model_name): return ModelCheckpoint(filepath = models_path + '/' + model_name + '.hdf5', monitor = "val_accuracy", save_best_only = True, # save_weights_only = True, verbose=1)[174]:
def train_model(model, model_name, train_X, train_Y, val_X, val_Y, batch_size, epochs, patience=50): model.fit(train_X, train_Y, batch_size=batch_size, epochs=epochs, validation_data=(val_X, val_Y), callbacks=[plot_losses(), early_stop(metric='val_accuracy', mode = 'max', patience=patience), checkpoint_callback(model_name)]) return model[175]:
def prediction_and_metrics(model, test_X, test_Y): out_Y_org = model.predict(test_X, verbose=0) out_Y = [0 if x < 0.5 else 1 for x in out_Y_org] x = range(0, len(test_Y)) fig = plt.figure(figsize=(18, 4)) colors = ['blue' if val == 0. else 'red' for val in np.asarray(test_Y)] plt.scatter(x, out_Y_org, marker='.', label ='predicted', c=colors) plt.plot(x, [0.5] * len(test_Y), c='orange') plt.ylim((0,1)) create_confusion_matrix(test_Y, out_Y) get_model_metrics(test_Y, out_Y)xxxxxxxxxxModel 1. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[ ]:
def create_model_1(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True)(masked_input) lstm2 = LSTM(64)(lstm1) dropout = Dropout(0.5)(lstm2) output_layer = Dense(1, activation='sigmoid')(dropout) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=50, epochs=300¶
xxxxxxxxxxCreate and train model¶
[183]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_50_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[184]:
x
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=50, epochs=400, patience=100)accuracy training (min: 0.499, max: 0.901, cur: 0.896) validation (min: 0.465, max: 0.509, cur: 0.476) Loss training (min: 0.195, max: 0.696, cur: 0.199) validation (min: 0.694, max: 2.743, cur: 2.587) Epoch 101: val_accuracy did not improve from 0.50942 139/139 [==============================] - 6s 43ms/step - loss: 0.1986 - accuracy: 0.8963 - val_loss: 2.5874 - val_accuracy: 0.4764
xxxxxxxxxxPrediction and results¶
[185]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.48013698630136986 Precision: [0.47966339 0.48058902] Recall: 0.4917808219178082 F1 score: 0.486121 ROC AUC: 0.480137
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=100, epochs=300¶
xxxxxxxxxxCreate and train model¶
[186]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_100_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[187]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=100, epochs=400, patience=100)accuracy training (min: 0.504, max: 0.896, cur: 0.896) validation (min: 0.472, max: 0.513, cur: 0.493) Loss training (min: 0.197, max: 0.696, cur: 0.197) validation (min: 0.693, max: 2.659, cur: 2.636) Epoch 117: val_accuracy did not improve from 0.51279 70/70 [==============================] - 4s 61ms/step - loss: 0.1965 - accuracy: 0.8958 - val_loss: 2.6361 - val_accuracy: 0.4933
xxxxxxxxxxPrediction and results¶
[188]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.49383561643835616 Precision: [0.49454545 0.49291339] Recall: 0.42876712328767125 F1 score: 0.458608 ROC AUC: 0.493836
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[189]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[190]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.498, max: 0.815, cur: 0.812) validation (min: 0.480, max: 0.516, cur: 0.504) Loss training (min: 0.336, max: 0.695, cur: 0.336) validation (min: 0.693, max: 1.879, cur: 1.879) Epoch 103: val_accuracy did not improve from 0.51615 28/28 [==============================] - 3s 108ms/step - loss: 0.3357 - accuracy: 0.8115 - val_loss: 1.8786 - val_accuracy: 0.5040
xxxxxxxxxxPrediction and results¶
[191]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.476027397260274 Precision: [0.46534653 0.48167539] Recall: 0.6301369863013698 F1 score: 0.545994 ROC AUC: 0.476027
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction on training subset¶
[192]:
prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)Accuracy: 0.5185718399078606 Precision: [0.52686381 0.51419142] Recall: 0.6729052692196947 F1 score: 0.582938 ROC AUC: 0.518572
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=500, epochs=300¶
xxxxxxxxxxCreate and train model¶
[193]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_500_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_1(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[194]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=500, epochs=400, patience=100)accuracy training (min: 0.499, max: 0.930, cur: 0.920) validation (min: 0.485, max: 0.517, cur: 0.515) Loss training (min: 0.119, max: 0.695, cur: 0.148) validation (min: 0.694, max: 4.214, cur: 2.817) Epoch 339: val_accuracy did not improve from 0.51682 14/14 [==============================] - 3s 191ms/step - loss: 0.1481 - accuracy: 0.9198 - val_loss: 2.8168 - val_accuracy: 0.5155
xxxxxxxxxxPrediction and results¶
[195]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.5 Precision: [0.5 0.5] Recall: 0.5178082191780822 F1 score: 0.508748 ROC AUC: 0.500000
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 0. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[423]:
def create_model_0(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) # lstm1 = LSTM(128, return_sequences=True)(masked_input) # lstm1_dropout1 = Dropout(0.2)(lstm1) # lstm2 = LSTM(64)(lstm1_dropout1) flatten_layer1 = Flatten()(masked_input) danse_layer1 = Dense(64, activation='relu')(flatten_layer1) dropout = Dropout(0.2)(danse_layer1) output_layer = Dense(1, activation='sigmoid')(dropout) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[424]:
model_name = 'model_tweets_data_based_10000_0_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_0(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[425]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.490, max: 0.795, cur: 0.795) validation (min: 0.483, max: 0.535, cur: 0.507) Loss training (min: 0.421, max: 0.768, cur: 0.426) validation (min: 0.714, max: 1.137, cur: 0.996) Epoch 102: val_accuracy did not improve from 0.53499 28/28 [==============================] - 1s 35ms/step - loss: 0.4258 - accuracy: 0.7946 - val_loss: 0.9959 - val_accuracy: 0.5067
xxxxxxxxxxPrediction and results¶
[426]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.5198630136986301 Precision: [0.52056738 0.5192053 ] Recall: 0.536986301369863 F1 score: 0.527946 ROC AUC: 0.519863
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results of training set¶
[427]:
prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)Accuracy: 0.5407428735963145 Precision: [0.54253081 0.5390992 ] Recall: 0.5617621652749784 F1 score: 0.550197 ROC AUC: 0.540743
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 2. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[196]:
def create_model_2(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(128, return_sequences=True)(masked_input) lstm1_dropout1 = Dropout(0.2)(lstm1) lstm2 = LSTM(64)(lstm1_dropout1) dropout = Dropout(0.2)(lstm2) output_layer = Dense(1, activation='sigmoid')(dropout) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[197]:
model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_2(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[198]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.918, cur: 0.915) validation (min: 0.464, max: 0.513, cur: 0.489) Loss training (min: 0.148, max: 0.695, cur: 0.148) validation (min: 0.693, max: 2.818, cur: 2.770) Epoch 122: val_accuracy did not improve from 0.51279 28/28 [==============================] - 4s 147ms/step - loss: 0.1482 - accuracy: 0.9148 - val_loss: 2.7697 - val_accuracy: 0.4886
xxxxxxxxxxPrediction and results¶
[199]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.5020547945205479 Precision: [0.50189633 0.50224215] Recall: 0.4602739726027397 F1 score: 0.480343 ROC AUC: 0.502055
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 3. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[200]:
def create_model_3(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(128, return_sequences=True)(masked_input) lstm1_dropout1 = Dropout(0.2)(lstm1) lstm2 = LSTM(64)(lstm1_dropout1) lstm2_dropout = Dropout(0.2)(lstm2) dense_layer1 = Dense(64)(lstm2_dropout) dense_layer1_activation_layer1 = Activation('relu')(dense_layer1) dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1) output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[201]:
model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_3(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[202]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.909, cur: 0.908) validation (min: 0.481, max: 0.515, cur: 0.491) Loss training (min: 0.173, max: 0.694, cur: 0.173) validation (min: 0.692, max: 2.711, cur: 2.674) Epoch 109: val_accuracy did not improve from 0.51548 28/28 [==============================] - 4s 143ms/step - loss: 0.1728 - accuracy: 0.9083 - val_loss: 2.6736 - val_accuracy: 0.4906
xxxxxxxxxxPrediction and results¶
[203]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.49931506849315066 Precision: [0.49941107 0.49918167] Recall: 0.4178082191780822 F1 score: 0.454884 ROC AUC: 0.499315
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 4. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[204]:
def create_model_4(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=False)(masked_input) lstm1_dropout1 = Dropout(0.2)(lstm1) # lstm2 = LSTM(64)(lstm1_dropout1) # lstm2_dropout = Dropout(0.2)(lstm2) dense_layer1 = Dense(64)(lstm1_dropout1) dense_layer1_activation_layer1 = Activation('relu')(dense_layer1) dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1) output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[205]:
model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_4(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[206]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.875, cur: 0.868) validation (min: 0.482, max: 0.521, cur: 0.502) Loss training (min: 0.239, max: 0.696, cur: 0.258) validation (min: 0.694, max: 2.371, cur: 2.371) Epoch 153: val_accuracy did not improve from 0.52086 28/28 [==============================] - 2s 74ms/step - loss: 0.2583 - accuracy: 0.8675 - val_loss: 2.3705 - val_accuracy: 0.5020
xxxxxxxxxxPrediction and results¶
[207]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.49726027397260275 Precision: [0.49725275 0.49726776] Recall: 0.4986301369863014 F1 score: 0.497948 ROC AUC: 0.497260
<Figure size 640x480 with 0 Axes>
[ ]:
xxxxxxxxxxModel 5. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[208]:
def create_model_5(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True)(masked_input) lstm2 = LSTM(64)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[209]:
model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_5(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[210]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.900, cur: 0.897) validation (min: 0.469, max: 0.517, cur: 0.489) Loss training (min: 0.185, max: 0.694, cur: 0.192) validation (min: 0.693, max: 2.429, cur: 2.429) Epoch 101: val_accuracy did not improve from 0.51750 28/28 [==============================] - 3s 110ms/step - loss: 0.1921 - accuracy: 0.8968 - val_loss: 2.4288 - val_accuracy: 0.4892
xxxxxxxxxxPrediction and results¶
[211]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.46986301369863015 Precision: [0.46167247 0.4751693 ] Recall: 0.5767123287671233 F1 score: 0.521040 ROC AUC: 0.469863
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 6. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[212]:
def create_model_6(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True, recurrent_dropout=0.2)(masked_input) lstm2 = LSTM(64, recurrent_dropout=0.1)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[213]:
model_name = 'model_tweets_data_based_10000_6_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_6(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[214]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.501, max: 0.926, cur: 0.926) validation (min: 0.491, max: 0.522, cur: 0.499) Loss training (min: 0.145, max: 0.695, cur: 0.145) validation (min: 0.694, max: 2.625, cur: 2.625) Epoch 219: val_accuracy did not improve from 0.52221 28/28 [==============================] - 3s 100ms/step - loss: 0.1450 - accuracy: 0.9260 - val_loss: 2.6252 - val_accuracy: 0.4993
xxxxxxxxxxPrediction and results¶
[215]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.5123287671232877 Precision: [0.51246537 0.51219512] Recall: 0.5178082191780822 F1 score: 0.514986 ROC AUC: 0.512329
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 7. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[221]:
def create_model_7(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True, activation="relu")(masked_input) lstm2 = LSTM(64, activation="relu")(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[217]:
model_name = 'model_tweets_data_based_10000_7_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_7(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[218]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.497, max: 0.914, cur: 0.911) validation (min: 0.478, max: 0.522, cur: 0.501) Loss training (min: 0.149, max: 0.702, cur: 0.165) validation (min: 0.693, max: 4.006, cur: 3.745) Epoch 173: val_accuracy did not improve from 0.52221 28/28 [==============================] - 3s 116ms/step - loss: 0.1645 - accuracy: 0.9109 - val_loss: 3.7446 - val_accuracy: 0.5007
xxxxxxxxxxPrediction and results¶
[219]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.5061643835616438 Precision: [0.50576184 0.50662739] Recall: 0.4712328767123288 F1 score: 0.488290 ROC AUC: 0.506164
<Figure size 640x480 with 0 Axes>
[222]:
prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)Accuracy: 0.7333717247336596 Precision: [0.71527224 0.75479409] Recall: 0.6913331413763317 F1 score: 0.721671 ROC AUC: 0.733372
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 7a. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[175]:
def create_model_7a(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True, activation="relu")(masked_input) lstm2 = LSTM(512, activation="relu")(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return model2023-09-03 16:23:42.684511: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-09-03 16:23:42.684567: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-09-03 16:23:42.684610: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b0f306797141): /proc/driver/nvidia/version does not exist 2023-09-03 16:23:42.686593: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
xxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[176]:
model_name = 'model_tweets_data_based_10000_7a_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_7a(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[ ]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.497, max: 0.966, cur: 0.960) validation (min: 0.483, max: 0.529, cur: 0.513) Loss training (min: 0.064, max: 0.715, cur: 0.076) validation (min: 0.693, max: 4.596, cur: 3.335) Epoch 299: val_accuracy did not improve from 0.52894 28/28 [==============================] - 11s 400ms/step - loss: 0.0759 - accuracy: 0.9601 - val_loss: 3.3347 - val_accuracy: 0.5135 Epoch 300/400 7/28 [======>.......................] - ETA: 8s - loss: 0.0702 - accuracy: 0.9680
xxxxxxxxxxPrediction and results¶
[ ]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)[ ]:
prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)xxxxxxxxxxModel 7b. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[ ]:
def create_model_7b(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(64, return_sequences=True, activation="relu", kernel_regularizer =tf.keras.regularizers.l1( l=0.01))(masked_input) lstm2 = LSTM(128, activation="relu", kernel_regularizer =tf.keras.regularizers.l1( l=0.01))(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
[ ]:
model_name = 'model_tweets_data_based_10000_7b_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_7b(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[ ]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)xxxxxxxxxxPrediction and results¶
[ ]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)xxxxxxxxxxPrediction and results of training set¶
[ ]:
prediction_and_metrics(model, compact_train_tweets_add_feat_data_padded, train_users_data_Y)xxxxxxxxxxModel 8. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[223]:
def create_model_8(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) lstm1 = LSTM(16, return_sequences=True)(masked_input) lstm2 = LSTM(16)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[224]:
model_name = 'model_tweets_data_based_10000_8_v1_batch_size_250_20_latest_tweets_of_user_padded_add_feat_only'model = create_model_8(add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[225]:
model = train_model(model, model_name, train_X=compact_train_tweets_add_feat_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_add_feat_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.491, max: 0.682, cur: 0.678) validation (min: 0.484, max: 0.518, cur: 0.497) Loss training (min: 0.563, max: 0.694, cur: 0.563) validation (min: 0.693, max: 0.878, cur: 0.877) Epoch 109: val_accuracy did not improve from 0.51817 28/28 [==============================] - 2s 59ms/step - loss: 0.5630 - accuracy: 0.6784 - val_loss: 0.8774 - val_accuracy: 0.4973
xxxxxxxxxxPrediction and results¶
[226]:
prediction_and_metrics(model, compact_test_tweets_add_feat_data_padded, test_users_data_Y)Accuracy: 0.4986301369863014 Precision: [0.4987715 0.49845201] Recall: 0.4410958904109589 F1 score: 0.468023 ROC AUC: 0.498630
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxCNN¶
[363]:
from keras.layers import SimpleRNNxxxxxxxxxxModel 9. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[186]:
def create_model_9(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input) # features_cnn_layer = MaxPooling1D()(features_cnn_layer) flatten_layer1 = Flatten()(cnn_layer1) output_layer = Dense(1, activation='sigmoid')(flatten_layer1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[187]:
model_name = 'model_tweets_data_based_10000_9_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'model = create_model_9(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[188]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.503, max: 0.871, cur: 0.843) validation (min: 0.480, max: 0.520, cur: 0.499) Loss training (min: 26.574, max: 5978.037, cur: 46.327) validation (min: 623.592, max: 2703.696, cur: 789.925) Epoch 202: val_accuracy did not improve from 0.52019 28/28 [==============================] - 1s 32ms/step - loss: 46.3273 - accuracy: 0.8431 - val_loss: 789.9250 - val_accuracy: 0.4987
xxxxxxxxxxPrediction and results¶
[189]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5095890410958904 Precision: [0.51100629 0.50849515] Recall: 0.5739726027397261 F1 score: 0.539254 ROC AUC: 0.509589
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 10. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[194]:
def create_model_10(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input) pooling_layer1 = MaxPooling1D(pool_size=2, strides=2, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(pooling_layer1) dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dense_layer1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[195]:
model_name = 'model_tweets_data_based_10000_10_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'model = create_model_10(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[196]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.488, max: 0.518, cur: 0.502) validation (min: 0.483, max: 0.518, cur: 0.499) Loss training (min: 0.797, max: 8489.423, cur: 0.964) validation (min: 0.692, max: 1467.734, cur: 0.693) Epoch 107: val_accuracy did not improve from 0.51817 28/28 [==============================] - 1s 32ms/step - loss: 0.9644 - accuracy: 0.5022 - val_loss: 0.6929 - val_accuracy: 0.4993
xxxxxxxxxxPrediction and results¶
[197]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5068493150684932 Precision: [0.50595238 0.50806452] Recall: 0.4315068493150685 F1 score: 0.466667 ROC AUC: 0.506849
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[201]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.506334581053844 Precision: [0.50554435 0.50738751] Recall: 0.43507054419809965 F1 score: 0.468455 ROC AUC: 0.506335
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 11. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[216]:
def create_model_11(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) cnn_layer1 = Conv1D(filters=64, kernel_size=3, activation='relu')(masked_input) pooling_layer1 = MaxPooling1D(pool_size=4, strides=2, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(pooling_layer1) dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dropout_layer1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[217]:
model_name = 'model_tweets_data_based_10000_11_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'model = create_model_11(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[218]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.508, max: 0.579, cur: 0.533) validation (min: 0.475, max: 0.528, cur: 0.503) Loss training (min: 2.630, max: 11262.217, cur: 2.655) validation (min: 4.687, max: 4742.995, cur: 4.936) Epoch 134: val_accuracy did not improve from 0.52826 28/28 [==============================] - 1s 33ms/step - loss: 2.6553 - accuracy: 0.5334 - val_loss: 4.9364 - val_accuracy: 0.5034
xxxxxxxxxxPrediction and results¶
[219]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.49794520547945204 Precision: [0.49677419 0.49849246] Recall: 0.6794520547945205 F1 score: 0.575072 ROC AUC: 0.497945
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[220]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.5813417794414051 Precision: [0.63012437 0.5591623 ] Recall: 0.7687877915346962 F1 score: 0.647430 ROC AUC: 0.581342
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 12. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[224]:
def create_model_12(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input) pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(pooling_layer1) dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dropout_layer1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[225]:
model_name = 'model_tweets_data_based_10000_12_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'model = create_model_12(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[226]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.498, max: 0.533, cur: 0.512) validation (min: 0.483, max: 0.533, cur: 0.508) Loss training (min: 1.001, max: 17936.178, cur: 1.036) validation (min: 1.909, max: 7260.793, cur: 1.944) Epoch 176: val_accuracy did not improve from 0.53297 28/28 [==============================] - 1s 32ms/step - loss: 1.0360 - accuracy: 0.5117 - val_loss: 1.9438 - val_accuracy: 0.5081
xxxxxxxxxxPrediction and results¶
[227]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.4917808219178082 Precision: [0.49393939 0.48723404] Recall: 0.3136986301369863 F1 score: 0.381667 ROC AUC: 0.491781
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[228]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.5200115174200979 Precision: [0.5146285 0.53166287] Recall: 0.3360207313561762 F1 score: 0.411785 ROC AUC: 0.520012
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 13. (only additional features of tweets)¶
xxxxxxxxxxCreate model¶
[238]:
def create_model_13(add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) print(additional_tweet_input.shape) masked_input = Masking(mask_value=0.0)(additional_tweet_input) print(masked_input.shape) cnn_layer1 = Conv1D(filters=15, kernel_size=3, activation='relu')(masked_input) print(cnn_layer1.shape) # pooling_layer1 = MaxPooling1D(pool_size=4, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(cnn_layer1) print(flatten_layer1.shape) dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dropout_layer1) model = keras.Model(inputs=additional_tweet_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[239]:
model_name = 'model_tweets_data_based_10000_13_v1_batch_size_250_20_latest_tweets_of_user_padded_tweet_text_only'model = create_model_13(add_tweets_feat_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))(None, 20, 15) (None, 20, 15) (None, 18, 15) (None, 270)
[240]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.485, max: 0.541, cur: 0.506) validation (min: 0.471, max: 0.535, cur: 0.495) Loss training (min: 2.628, max: 12445.366, cur: 2.777) validation (min: 2.351, max: 6456.548, cur: 2.813) Epoch 139: val_accuracy did not improve from 0.53499 28/28 [==============================] - 1s 30ms/step - loss: 2.7770 - accuracy: 0.5056 - val_loss: 2.8129 - val_accuracy: 0.4946
xxxxxxxxxxPrediction and results¶
[241]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5116438356164383 Precision: [0.51185495 0.51144011] Recall: 0.5205479452054794 F1 score: 0.515954 ROC AUC: 0.511644
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[242]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.5192916786639793 Precision: [0.51918671 0.5193978 ] Recall: 0.5165562913907285 F1 score: 0.517973 ROC AUC: 0.519292
<Figure size 640x480 with 0 Axes>
[ ]:
[ ]:
[ ]:
[ ]:
xxxxxxxxxxxxxxxxxxxxModel 1. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[275]:
def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) # lstm1 = LSTM(64, return_sequences=True)(masked_input) lstm2 = LSTM(64)(reshape) # dropout = Dropout(0.5)(lstm2) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=50, epochs=300¶
xxxxxxxxxxCreate and train model¶
[276]:
model_name = 'model_tweets_data_based_10000_1_v2_batch_size_50_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[277]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=50, epochs=400, patience=100)accuracy training (min: 0.499, max: 1.000, cur: 0.998) validation (min: 0.491, max: 0.532, cur: 0.515) Loss training (min: 0.001, max: 0.703, cur: 0.005) validation (min: 0.693, max: 4.547, cur: 3.618) Epoch 273: val_accuracy did not improve from 0.53163 139/139 [==============================] - 6s 40ms/step - loss: 0.0051 - accuracy: 0.9983 - val_loss: 3.6175 - val_accuracy: 0.5155
xxxxxxxxxxPrediction and results¶
[278]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5082191780821917 Precision: [0.50826446 0.50817439] Recall: 0.510958904109589 F1 score: 0.509563 ROC AUC: 0.508219
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=50, epochs=300¶
xxxxxxxxxxCreate and train model¶
[276]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_50_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[277]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=50, epochs=400, patience=100)accuracy training (min: 0.499, max: 1.000, cur: 0.998) validation (min: 0.491, max: 0.532, cur: 0.515) Loss training (min: 0.001, max: 0.703, cur: 0.005) validation (min: 0.693, max: 4.547, cur: 3.618) Epoch 273: val_accuracy did not improve from 0.53163 139/139 [==============================] - 6s 40ms/step - loss: 0.0051 - accuracy: 0.9983 - val_loss: 3.6175 - val_accuracy: 0.5155
xxxxxxxxxxPrediction and results¶
[278]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5082191780821917 Precision: [0.50826446 0.50817439] Recall: 0.510958904109589 F1 score: 0.509563 ROC AUC: 0.508219
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=100, epochs=300¶
xxxxxxxxxxCreate and train model¶
[280]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_100_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[281]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=100, epochs=400, patience=100)accuracy training (min: 0.506, max: 1.000, cur: 0.999) validation (min: 0.490, max: 0.526, cur: 0.522) Loss training (min: 0.001, max: 0.703, cur: 0.002) validation (min: 0.696, max: 4.314, cur: 3.750) Epoch 108: val_accuracy did not improve from 0.52624 70/70 [==============================] - 4s 56ms/step - loss: 0.0016 - accuracy: 0.9993 - val_loss: 3.7500 - val_accuracy: 0.5215
xxxxxxxxxxPrediction and results¶
[282]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.49726027397260275 Precision: [0.49784946 0.49622642] Recall: 0.36027397260273974 F1 score: 0.417460 ROC AUC: 0.497260
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[287]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[288]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.507, max: 1.000, cur: 0.999) validation (min: 0.476, max: 0.521, cur: 0.485) Loss training (min: 0.001, max: 0.707, cur: 0.002) validation (min: 0.696, max: 3.933, cur: 3.933) Epoch 126: val_accuracy did not improve from 0.52086 28/28 [==============================] - 3s 114ms/step - loss: 0.0019 - accuracy: 0.9993 - val_loss: 3.9332 - val_accuracy: 0.4852
xxxxxxxxxxPrediction and results¶
[289]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5095890410958904 Precision: [0.50902062 0.51023392] Recall: 0.4780821917808219 F1 score: 0.493635 ROC AUC: 0.509589
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction on training subset¶
[290]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.9985603224877627 Precision: [0.99827338 0.99884759] Recall: 0.9982723869853153 F1 score: 0.998560 ROC AUC: 0.998560
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxbatch_size=500, epochs=300¶
xxxxxxxxxxCreate and train model¶
[291]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_500_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[292]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=500, epochs=400, patience=100)accuracy training (min: 0.498, max: 1.000, cur: 0.999) validation (min: 0.479, max: 0.517, cur: 0.509) Loss training (min: 0.001, max: 0.706, cur: 0.001) validation (min: 0.693, max: 3.862, cur: 3.835) Epoch 101: val_accuracy did not improve from 0.51750 14/14 [==============================] - 3s 200ms/step - loss: 0.0014 - accuracy: 0.9994 - val_loss: 3.8347 - val_accuracy: 0.5094
xxxxxxxxxxPrediction and results¶
[293]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5089041095890411 Precision: [0.50821745 0.50971599] Recall: 0.4671232876712329 F1 score: 0.487491 ROC AUC: 0.508904
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 2. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[299]:
def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(128, return_sequences=True)(reshape) lstm1_dropout1 = Dropout(0.2)(lstm1) lstm2 = LSTM(64)(lstm1_dropout1) dropout = Dropout(0.2)(lstm2) output_layer = Dense(1, activation='sigmoid')(dropout) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[300]:
model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[301]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 1.000, cur: 0.998) validation (min: 0.484, max: 0.514, cur: 0.504) Loss training (min: 0.001, max: 0.700, cur: 0.008) validation (min: 0.697, max: 4.317, cur: 2.814) Epoch 116: val_accuracy did not improve from 0.51413 28/28 [==============================] - 6s 227ms/step - loss: 0.0080 - accuracy: 0.9977 - val_loss: 2.8140 - val_accuracy: 0.5040
xxxxxxxxxxPrediction and results¶
[302]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5027397260273972 Precision: [0.503003 0.50251889] Recall: 0.5465753424657535 F1 score: 0.523622 ROC AUC: 0.502740
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 3. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[303]:
def create_model_3(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(128, return_sequences=True)(reshape) lstm1_dropout1 = Dropout(0.2)(lstm1) lstm2 = LSTM(64)(lstm1_dropout1) lstm2_dropout = Dropout(0.2)(lstm2) dense_layer1 = Dense(64)(lstm2_dropout) dense_layer1_activation_layer1 = Activation('relu')(dense_layer1) dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1) output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy'])# dense_layer1 = Dense(128)(concatenated)# activation_layer1 = Activation('relu')(dense_layer1)# output_layer = Dense(1, activation='sigmoid')(activation_layer1) # model = Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) # model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[304]:
model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_3(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[305]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.514, max: 1.000, cur: 0.999) validation (min: 0.480, max: 0.520, cur: 0.495) Loss training (min: 0.001, max: 0.694, cur: 0.001) validation (min: 0.695, max: 4.917, cur: 4.917) Epoch 121: val_accuracy did not improve from 0.52019 28/28 [==============================] - 6s 215ms/step - loss: 9.6534e-04 - accuracy: 0.9994 - val_loss: 4.9168 - val_accuracy: 0.4946
xxxxxxxxxxPrediction and results¶
[306]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5164383561643836 Precision: [0.51612903 0.51675978] Recall: 0.5068493150684932 F1 score: 0.511757 ROC AUC: 0.516438
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 4. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[307]:
def create_model_4(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=False)(reshape) lstm1_dropout1 = Dropout(0.2)(lstm1) # lstm2 = LSTM(64)(lstm1_dropout1) # lstm2_dropout = Dropout(0.2)(lstm2) dense_layer1 = Dense(64)(lstm1_dropout1) dense_layer1_activation_layer1 = Activation('relu')(dense_layer1) dense_layer1_dropout1 = Dropout(0.1)(dense_layer1_activation_layer1) output_layer = Dense(1, activation='sigmoid')(dense_layer1_dropout1) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[308]:
model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_4(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[309]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.517, max: 1.000, cur: 0.998) validation (min: 0.487, max: 0.526, cur: 0.499) Loss training (min: 0.001, max: 0.696, cur: 0.006) validation (min: 0.697, max: 4.915, cur: 3.429) Epoch 238: val_accuracy did not improve from 0.52557 28/28 [==============================] - 4s 147ms/step - loss: 0.0059 - accuracy: 0.9983 - val_loss: 3.4287 - val_accuracy: 0.4987
xxxxxxxxxxPrediction and results¶
[310]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5054794520547945 Precision: [0.50600601 0.50503778] Recall: 0.5493150684931507 F1 score: 0.526247 ROC AUC: 0.505479
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 5. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[312]:
def create_model_5(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=True)(reshape) lstm2 = LSTM(64)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[313]:
model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_5(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[314]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.506, max: 0.999, cur: 0.999) validation (min: 0.483, max: 0.532, cur: 0.503) Loss training (min: 0.001, max: 0.696, cur: 0.001) validation (min: 0.695, max: 4.273, cur: 4.272) Epoch 121: val_accuracy did not improve from 0.53163 28/28 [==============================] - 5s 167ms/step - loss: 9.8464e-04 - accuracy: 0.9993 - val_loss: 4.2722 - val_accuracy: 0.5027
xxxxxxxxxxPrediction and results¶
[315]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.4931506849315068 Precision: [0.49396135 0.49208861] Recall: 0.426027397260274 F1 score: 0.456681 ROC AUC: 0.493151
<Figure size 640x480 with 0 Axes>
[ ]:
xxxxxxxxxxModel 6. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[178]:
def create_model_6(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=True, recurrent_dropout=0.2)(reshape) lstm2 = LSTM(64, recurrent_dropout=0.1)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[179]:
model_name = 'model_tweets_data_based_10000_6_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_6(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[180]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 1.000, cur: 0.999) validation (min: 0.483, max: 0.524, cur: 0.487) Loss training (min: 0.001, max: 0.697, cur: 0.001) validation (min: 0.696, max: 4.327, cur: 4.326) Epoch 163: val_accuracy did not improve from 0.52355 28/28 [==============================] - 4s 139ms/step - loss: 9.8478e-04 - accuracy: 0.9994 - val_loss: 4.3261 - val_accuracy: 0.4865
xxxxxxxxxxPrediction and results¶
[181]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.49794520547945204 Precision: [0.49805951 0.49781659] Recall: 0.4684931506849315 F1 score: 0.482710 ROC AUC: 0.497945
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 7. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[182]:
def create_model_7(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=True, activation="relu")(reshape) lstm2 = LSTM(64, activation="relu")(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[183]:
model_name = 'model_tweets_data_based_10000_7_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_7(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[184]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.498, max: 0.973, cur: 0.923) validation (min: 0.485, max: 0.536, cur: 0.502) Loss training (min: 0.053, max: 2.874, cur: 0.217) validation (min: 0.694, max: 7.540, cur: 2.756) Epoch 193: val_accuracy did not improve from 0.53567 28/28 [==============================] - 3s 121ms/step - loss: 0.2167 - accuracy: 0.9228 - val_loss: 2.7559 - val_accuracy: 0.5020
xxxxxxxxxxPrediction and results¶
[185]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5123287671232877 Precision: [0.51339286 0.51142132] Recall: 0.552054794520548 F1 score: 0.530962 ROC AUC: 0.512329
<Figure size 640x480 with 0 Axes>
[187]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.637633170169882 Precision: [0.64826303 0.62842558] Recall: 0.6734811402245897 F1 score: 0.650174 ROC AUC: 0.637633
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 8. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[188]:
def create_model_8(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(16, return_sequences=True)(reshape) lstm2 = LSTM(16)(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[189]:
model_name = 'model_tweets_data_based_10000_8_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_8(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[190]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 0.999, cur: 0.999) validation (min: 0.488, max: 0.521, cur: 0.503) Loss training (min: 0.002, max: 0.695, cur: 0.002) validation (min: 0.695, max: 3.180, cur: 3.099) Epoch 109: val_accuracy did not improve from 0.52086 28/28 [==============================] - 2s 76ms/step - loss: 0.0020 - accuracy: 0.9990 - val_loss: 3.0988 - val_accuracy: 0.5027
xxxxxxxxxxPrediction and results¶
[191]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.510958904109589 Precision: [0.50952381 0.51290323] Recall: 0.43561643835616437 F1 score: 0.471111 ROC AUC: 0.510959
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 9. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[200]:
def create_model_9(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=True, dropout=0.8, activation="relu")(reshape) lstm2 = LSTM(64, activation="relu")(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[201]:
model_name = 'model_tweets_data_based_10000_9_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_9(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[202]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.902, cur: 0.892) validation (min: 0.462, max: 0.517, cur: 0.499) Loss training (min: 0.238, max: 0.714, cur: 0.253) validation (min: 0.693, max: 1.640, cur: 1.303) Epoch 286: val_accuracy did not improve from 0.51750 28/28 [==============================] - 4s 142ms/step - loss: 0.2527 - accuracy: 0.8920 - val_loss: 1.3031 - val_accuracy: 0.4987
xxxxxxxxxxPrediction and results¶
[205]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5020547945205479 Precision: [0.50194553 0.50217707] Recall: 0.473972602739726 F1 score: 0.487667 ROC AUC: 0.502055
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 10. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[206]:
def create_model_10(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = Bidirectional(LSTM(64, return_sequences=True, activation="relu"))(reshape) lstm2 = LSTM(64, activation="relu")(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[207]:
model_name = 'model_tweets_data_based_10000_10_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_10(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[208]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.498, max: 0.980, cur: 0.968) validation (min: 0.472, max: 0.538, cur: 0.485) Loss training (min: 0.045, max: 85.676, cur: 0.088) validation (min: 0.694, max: 149.277, cur: 3.631) Epoch 386: val_accuracy did not improve from 0.53769 28/28 [==============================] - 5s 171ms/step - loss: 0.0877 - accuracy: 0.9685 - val_loss: 3.6306 - val_accuracy: 0.4845
[440]:
compact_train_tweets_text_data_padded.shape[440]:
(6946, 20, 15)
xxxxxxxxxxPrediction and results¶
[209]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5 Precision: [0.5 0.5] Recall: 0.6876712328767123 F1 score: 0.579008 ROC AUC: 0.500000
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 11. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[210]:
def create_model_11(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) lstm1 = LSTM(64, return_sequences=True, activation="relu")(reshape) lstm2 = Bidirectional(LSTM(64, activation="relu"))(lstm1) output_layer = Dense(1, activation='sigmoid')(lstm2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=300¶
xxxxxxxxxxCreate and train model¶
[211]:
model_name = 'model_tweets_data_based_10000_11_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_11(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[212]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.494, max: 0.802, cur: 0.704) validation (min: 0.476, max: 0.527, cur: 0.500) Loss training (min: 0.472, max: 39.167, cur: 0.528) validation (min: 0.694, max: 56.739, cur: 1.835) Epoch 198: val_accuracy did not improve from 0.52692 28/28 [==============================] - 3s 125ms/step - loss: 0.5281 - accuracy: 0.7039 - val_loss: 1.8355 - val_accuracy: 0.5000
xxxxxxxxxxPrediction and results¶
[213]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.473972602739726 Precision: [0.47625 0.47121212] Recall: 0.426027397260274 F1 score: 0.447482 ROC AUC: 0.473973
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxCNN¶
xxxxxxxxxxModel 12. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[359]:
def create_model_12(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(cnn_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(flatten_layer1) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[360]:
model_name = 'model_tweets_data_based_10000_12_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_12(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[361]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.505, max: 1.000, cur: 0.999) validation (min: 0.499, max: 0.528, cur: 0.523) Loss training (min: 0.005, max: 0.708, cur: 0.006) validation (min: 0.705, max: 5.081, cur: 5.061) Epoch 232: val_accuracy did not improve from 0.52826 28/28 [==============================] - 2s 68ms/step - loss: 0.0064 - accuracy: 0.9993 - val_loss: 5.0612 - val_accuracy: 0.5229
xxxxxxxxxxPrediction and results¶
[362]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.47876712328767124 Precision: [0.4786795 0.47885402] Recall: 0.4808219178082192 F1 score: 0.479836 ROC AUC: 0.478767
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[363]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.9969766772243017 Precision: [0.99711982 0.99683362] Recall: 0.9971206449755254 F1 score: 0.996977 ROC AUC: 0.996977
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 13. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[395]:
def create_model_13(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) cnn_layer1 = Conv2D(filters=16, kernel_size=3, activation='relu')(masked_input) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(cnn_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(flatten_layer1) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[396]:
model_name = 'model_tweets_data_based_10000_13_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_13(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[397]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 1.000, cur: 0.999) validation (min: 0.497, max: 0.524, cur: 0.509) Loss training (min: 0.003, max: 0.705, cur: 0.005) validation (min: 0.698, max: 4.305, cur: 4.284) Epoch 174: val_accuracy did not improve from 0.52355 28/28 [==============================] - 3s 100ms/step - loss: 0.0047 - accuracy: 0.9990 - val_loss: 4.2841 - val_accuracy: 0.5094
xxxxxxxxxxPrediction and results¶
[398]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5013698630136987 Precision: [0.50143266 0.50131234] Recall: 0.5232876712328767 F1 score: 0.512064 ROC AUC: 0.501370
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[399]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.9975525482291967 Precision: [0.99712313 0.99798271] Recall: 0.9971206449755254 F1 score: 0.997551 ROC AUC: 0.997553
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 14. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[428]:
def create_model_14(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) cnn_layer1 = Conv2D(filters=16, kernel_size=(3,3), activation='relu')(masked_input) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(cnn_layer1) dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dropout_layer1) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[429]:
model_name = 'model_tweets_data_based_10000_14_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_14(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[430]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 0.993, cur: 0.993) validation (min: 0.478, max: 0.536, cur: 0.510) Loss training (min: 0.026, max: 0.715, cur: 0.026) validation (min: 0.700, max: 2.998, cur: 2.873) Epoch 330: val_accuracy did not improve from 0.53634 28/28 [==============================] - 3s 112ms/step - loss: 0.0257 - accuracy: 0.9929 - val_loss: 2.8729 - val_accuracy: 0.5101
xxxxxxxxxxPrediction and results¶
[431]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.4876712328767123 Precision: [0.48728814 0.48803191] Recall: 0.5027397260273972 F1 score: 0.495277 ROC AUC: 0.487671
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[432]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.9971206449755254 Precision: [0.99683453 0.99740709] Recall: 0.996832709473078 F1 score: 0.997120 ROC AUC: 0.997121
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 15. (only text of tweets)¶
xxxxxxxxxxCreate model¶
[433]:
def create_model_15(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input = Masking(mask_value=0.0)(embedding_layer) # flatten = Flatten()(masked_input) # print(flatten.shape) # Reshape layer to flatten only the last two dimensions # reshape = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(masked_input) dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) flatten_layer1 = Flatten()(cnn_layer1) dropout_layer2 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(dropout_layer2) model = keras.Model(inputs=text_input, outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[434]:
model_name = 'model_tweets_data_based_10000_15_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_only'model = create_model_15(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[435]:
model = train_model(model, model_name, train_X=compact_train_tweets_text_data_padded, train_Y=train_users_data_Y, val_X=compact_val_tweets_text_data_padded, val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.503, max: 0.929, cur: 0.929) validation (min: 0.495, max: 0.517, cur: 0.508) Loss training (min: 0.174, max: 0.724, cur: 0.175) validation (min: 0.700, max: 1.543, cur: 1.539) Epoch 103: val_accuracy did not improve from 0.51750 28/28 [==============================] - 2s 67ms/step - loss: 0.1748 - accuracy: 0.9289 - val_loss: 1.5394 - val_accuracy: 0.5081
xxxxxxxxxxPrediction and results¶
[436]:
prediction_and_metrics(model, compact_test_tweets_text_data_padded, test_users_data_Y)Accuracy: 0.5260273972602739 Precision: [0.52661064 0.52546917] Recall: 0.536986301369863 F1 score: 0.531165 ROC AUC: 0.526027
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[437]:
prediction_and_metrics(model, compact_train_tweets_text_data_padded, train_users_data_Y)Accuracy: 0.6474229772530953 Precision: [0.64528944 0.64962011] Recall: 0.6400806219406853 F1 score: 0.644815 ROC AUC: 0.647423
<Figure size 640x480 with 0 Axes>
[ ]:
[ ]:
xxxxxxxxxxxxxxxxxxxxTweets text and additional tweets data¶
xxxxxxxxxxModel 1. (tweets text and addtitional tweets features)¶
xxxxxxxxxxCreate model¶
[447]:
def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) # dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_layer_2, cnn_flatten_layer1]) # dense_layer1 = Dense(128)(concatenated) # activation_layer1 = Activation('relu')(dense_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[448]:
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[449]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.500, max: 1.000, cur: 1.000) validation (min: 0.506, max: 0.540, cur: 0.530) Loss training (min: 0.001, max: 0.704, cur: 0.001) validation (min: 0.693, max: 4.179, cur: 4.179) Epoch 184: val_accuracy did not improve from 0.54038 28/28 [==============================] - 2s 88ms/step - loss: 6.1684e-04 - accuracy: 0.9997 - val_loss: 4.1787 - val_accuracy: 0.5303
xxxxxxxxxxzbior trenujacy 4 epoki > 0.9¶
xxxxxxxxxxPrediction and results¶
[450]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)Accuracy: 0.4897260273972603 Precision: [0.48951049 0.48993289] Recall: 0.5 F1 score: 0.494915 ROC AUC: 0.489726
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[452]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)Accuracy: 0.9995680967463288 Precision: [0.99971198 0.99942429] Recall: 0.9997120644975526 F1 score: 0.999568 ROC AUC: 0.999568
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 2. (tweets text and addtitional tweets features)¶
xxxxxxxxxxCreate model¶
[453]:
def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.25)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.25)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2]) # dense_layer1 = Dense(128)(concatenated) # activation_layer1 = Activation('relu')(dense_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[454]:
model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[455]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 0.998, cur: 0.997) validation (min: 0.476, max: 0.527, cur: 0.512) Loss training (min: 0.009, max: 0.700, cur: 0.010) validation (min: 0.695, max: 3.100, cur: 2.997) Epoch 178: val_accuracy did not improve from 0.52692 28/28 [==============================] - 3s 95ms/step - loss: 0.0104 - accuracy: 0.9968 - val_loss: 2.9973 - val_accuracy: 0.5121
xxxxxxxxxxPrediction and results¶
[456]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)Accuracy: 0.5020547945205479 Precision: [0.50223547 0.50190114] Recall: 0.5424657534246575 F1 score: 0.521396 ROC AUC: 0.502055
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[457]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)Accuracy: 0.9995680967463288 Precision: [1. 0.99913694] Recall: 1.0 F1 score: 0.999568 ROC AUC: 0.999568
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 3. (tweets text and addtitional tweets features)¶
xxxxxxxxxxCreate model¶
[176]:
x
def create_model_3(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2]) # dense_layer1 = Dense(128)(concatenated) # activation_layer1 = Activation('relu')(dense_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return model2023-09-03 23:50:59.834565: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2023-09-03 23:50:59.879449: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-09-03 23:50:59.879546: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0f59a5867620): /proc/driver/nvidia/version does not exist 2023-09-03 23:51:00.057054: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
xxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[177]:
model_name = 'model_tweets_data_based_10000_3_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'model = create_model_3(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[178]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.503, max: 0.998, cur: 0.995) validation (min: 0.493, max: 0.525, cur: 0.509) Loss training (min: 0.012, max: 0.701, cur: 0.015) validation (min: 0.694, max: 2.495, cur: 2.408) Epoch 107: val_accuracy did not improve from 0.52490 28/28 [==============================] - 2s 84ms/step - loss: 0.0153 - accuracy: 0.9954 - val_loss: 2.4077 - val_accuracy: 0.5087
xxxxxxxxxxPrediction and results¶
[179]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)Accuracy: 0.5061643835616438 Precision: [0.50704225 0.50548112] Recall: 0.5684931506849316 F1 score: 0.535139 ROC AUC: 0.506164
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[180]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)Accuracy: 0.9412611575007198 Precision: [0.93035664 0.95273264] Recall: 0.9285919953930319 F1 score: 0.940507 ROC AUC: 0.941261
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 4. (tweets text and addtitional tweets features)¶
xxxxxxxxxxCreate model¶
[181]:
x
def create_model_4(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.25)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.25)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2]) # dense_layer1 = Dense(128)(concatenated) # activation_layer1 = Activation('relu')(dense_layer1) # dropout_layer1 = Dropout(0.2)(flatten_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[182]:
model_name = 'model_tweets_data_based_10000_4_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'model = create_model_4(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[183]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.504, max: 0.998, cur: 0.997) validation (min: 0.487, max: 0.536, cur: 0.516) Loss training (min: 0.007, max: 0.702, cur: 0.010) validation (min: 0.693, max: 3.126, cur: 3.084) Epoch 198: val_accuracy did not improve from 0.53567 28/28 [==============================] - 2s 90ms/step - loss: 0.0104 - accuracy: 0.9967 - val_loss: 3.0836 - val_accuracy: 0.5162
xxxxxxxxxxPrediction and results¶
[184]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)Accuracy: 0.4780821917808219 Precision: [0.4761194 0.47974684] Recall: 0.5191780821917809 F1 score: 0.498684 ROC AUC: 0.478082
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[185]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)Accuracy: 0.9994241289951051 Precision: [0.9997119 0.99913669] Recall: 0.9997120644975526 F1 score: 0.999424 ROC AUC: 0.999424
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxModel 5. (tweets text and addtitional tweets features)¶
xxxxxxxxxxCreate model¶
[189]:
x
def create_model_5(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(64)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.25)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2]) concatenated_dense_layer1 = Dense(16)(concatenated) concatenated_activation_layer1 = Activation('relu')(concatenated_dense_layer1) concatenated_dropout_layer1 = Dropout(0.2)(concatenated_activation_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated_dropout_layer1) model = keras.Model(inputs=[text_input, additional_tweet_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[190]:
model_name = 'model_tweets_data_based_10000_5_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features'model = create_model_5(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[191]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.496, max: 0.996, cur: 0.995) validation (min: 0.483, max: 0.530, cur: 0.521) Loss training (min: 0.012, max: 0.701, cur: 0.017) validation (min: 0.694, max: 3.489, cur: 3.364) Epoch 158: val_accuracy did not improve from 0.53028 28/28 [==============================] - 2s 90ms/step - loss: 0.0168 - accuracy: 0.9947 - val_loss: 3.3637 - val_accuracy: 0.5209
xxxxxxxxxxPrediction and results¶
[192]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded], test_users_data_Y)Accuracy: 0.5020547945205479 Precision: [0.50202977 0.50208044] Recall: 0.4958904109589041 F1 score: 0.498966 ROC AUC: 0.502055
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[193]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded], train_users_data_Y)Accuracy: 0.9988482579902102 Precision: [0.99942346 0.99827437] Recall: 0.9994241289951051 F1 score: 0.998849 ROC AUC: 0.998848
<Figure size 640x480 with 0 Axes>
[ ]:
xxxxxxxxxxx
# Tweets text and additional tweets data and user data[228]:
x
train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)x
## Model 1. (tweets text and addtitional tweets features and user data)x
#### Create model[233]:
x
def create_model_1(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, user_data_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input user_data_input = Input(shape=user_data_shape) # --------------------------------------------------------------------- # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) # lstm_layer_1 = LSTM(64, return_sequences=True)(masked_input_add_tweets_feat) lstm_layer_2 = LSTM(16)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.5)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2, user_data_input]) dense_layer1 = Dense(16)(concatenated) activation_layer1 = Activation('relu')(dense_layer1) dropout_layer1 = Dropout(0.2)(activation_layer1) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input, user_data_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelx
### batch_size=250, epochs=400x
#### Create and train model[234]:
x
model_name = 'model_tweets_data_based_10000_1_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features_and_user_data'model = create_model_1(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, user_data_shape=train_users_data.shape[1], optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[235]:
xxxxxxxxxx# train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)# val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)# test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)[236]:
x
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded, p1], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded, p2], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.492, max: 0.551, cur: 0.521) validation (min: 0.498, max: 0.613, cur: 0.536) Loss training (min: 12280035016704.000, max: 16165169201676288.000, cur: 51037809410048.000) validation (min: 24513566720.000, max: 11385774691844096.000, cur: 67552852049920.000) Epoch 175: val_accuracy did not improve from 0.61306 28/28 [==============================] - 2s 82ms/step - loss: 51037809410048.0000 - accuracy: 0.5209 - val_loss: 67552852049920.0000 - val_accuracy: 0.5363
x
#### Prediction and results[237]:
x
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded, test_users_data], test_users_data_Y)Accuracy: 0.5767123287671233 Precision: [0.57179487 0.58235294] Recall: 0.5424657534246575 F1 score: 0.561702 ROC AUC: 0.576712
<Figure size 640x480 with 0 Axes>
x
#### Prediction and results on training set[238]:
x
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)Accuracy: 0.6012093291102792 Precision: [0.5987637 0.60377916] Recall: 0.5888281025050389 F1 score: 0.596210 ROC AUC: 0.601209
<Figure size 640x480 with 0 Axes>
x
## Model 2. (tweets text and addtitional tweets features and user data)xxxxxxxxxxCreate model¶
[239]:
x
def create_model_2(num_words, embedding_dim, embedding_matrix, max_sequence_length, trainable, tweets_text_shape, add_tweets_feat_shape, user_data_shape, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)): # Additional tweet's features input user_data_input = Input(shape=user_data_shape) # --------------------------------------------------------------------- # Additional tweet's features input additional_tweet_input = Input(shape=add_tweets_feat_shape) masked_input_add_tweets_feat = Masking(mask_value=0.0)(additional_tweet_input) lstm_layer_2 = LSTM(16)(masked_input_add_tweets_feat) lstm_dropout_layer1 = Dropout(0.5)(lstm_layer_2) # --------------------------------------------------------------------- # Tweets text input text_input = Input(shape=tweets_text_shape) # Embedding layer for text embedding_layer = Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_sequence_length, trainable=trainable)(text_input) masked_input_text = Masking(mask_value=0.0)(embedding_layer) reshaped = Reshape((tweets_text_shape[0], tweets_text_shape[1]*embedding_dim))(masked_input_text) cnn_layer1 = Conv1D(filters=16, kernel_size=3, activation='relu')(reshaped) cnn_dropout_layer1 = Dropout(0.5)(cnn_layer1) # pooling_layer1 = MaxPooling1D(pool_size=2, strides=1, padding='valid')(cnn_layer1) cnn_flatten_layer1 = Flatten()(cnn_layer1) cnn_dropout_layer2 = Dropout(0.5)(cnn_flatten_layer1) # --------------------------------------------------------------------- # Concatenate text and additional features concatenated = concatenate([lstm_dropout_layer1, cnn_dropout_layer2, user_data_input]) # dense_layer1 = Dense(16)(concatenated) # activation_layer1 = Activation('relu')(dense_layer1) dropout_layer1 = Dropout(0.3)(concatenated) output_layer = Dense(1, activation='sigmoid')(concatenated) model = keras.Model(inputs=[text_input, additional_tweet_input, user_data_input], outputs=output_layer) model.compile(optimizer=optimizer, loss=keras.losses.BinaryCrossentropy(), metrics=['accuracy']) return modelxxxxxxxxxxbatch_size=250, epochs=400¶
xxxxxxxxxxCreate and train model¶
[240]:
x
model_name = 'model_tweets_data_based_10000_2_v1_batch_size_250_20_latest_tweets_of_user_padded_tweets_text_and_additional_tweets_features_and_user_data'model = create_model_2(num_words=num_words, embedding_dim=embedding_dim, embedding_matrix=embedding_matrix, max_sequence_length=max_length, trainable=False, tweets_text_shape=np.array(compact_train_tweets_text_data_padded[0]).shape, add_tweets_feat_shape=np.array(compact_train_tweets_add_feat_data_padded[0]).shape, user_data_shape=train_users_data.shape[1], optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))[241]:
train_users_data = tf.convert_to_tensor(train_users_data, dtype=tf.float32)val_users_data = tf.convert_to_tensor(val_users_data, dtype=tf.float32)test_users_data = tf.convert_to_tensor(test_users_data, dtype=tf.float32)[242]:
model = train_model(model, model_name, train_X=[compact_train_tweets_text_data_padded,compact_train_tweets_add_feat_data_padded, p1], train_Y=train_users_data_Y, val_X=[compact_val_tweets_text_data_padded, compact_val_tweets_add_feat_data_padded, p2], val_Y=val_users_data_Y, batch_size=250, epochs=400, patience=100)accuracy training (min: 0.487, max: 0.550, cur: 0.525) validation (min: 0.497, max: 0.617, cur: 0.539) Loss training (min: 12365196165120.000, max: 28899256534302720.000, cur: 36342876602368.000) validation (min: 341581594624.000, max: 24356484657709056.000, cur: 43420508749824.000) Epoch 283: val_accuracy did not improve from 0.61709 28/28 [==============================] - 2s 76ms/step - loss: 36342876602368.0000 - accuracy: 0.5255 - val_loss: 43420508749824.0000 - val_accuracy: 0.5390
xxxxxxxxxxPrediction and results¶
[243]:
prediction_and_metrics(model, [compact_test_tweets_text_data_padded, compact_test_tweets_add_feat_data_padded, test_users_data], test_users_data_Y)Accuracy: 0.5904109589041096 Precision: [0.60060976 0.58208955] Recall: 0.6410958904109589 F1 score: 0.610169 ROC AUC: 0.590411
<Figure size 640x480 with 0 Axes>
xxxxxxxxxxPrediction and results on training set¶
[244]:
prediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)Accuracy: 0.6020731356176217 Precision: [0.61535958 0.59153111] Recall: 0.659660236107112 F1 score: 0.623741 ROC AUC: 0.602073
<Figure size 640x480 with 0 Axes>
[ ]:
[244]:
xxxxxxxxxxprediction_and_metrics(model, [compact_train_tweets_text_data_padded, compact_train_tweets_add_feat_data_padded, train_users_data], train_users_data_Y)Advanced Tools
xxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx1